How to Benchmark Runtime on Cortex-M4 - arm

I'm pretty new to ARM and am trying to get timing results for functions written in C for a Cortex-M4 processor. Would any of you be able to tell me what steps I need to take to get timing results?
I've been running my code on Keil uVision, but I'm unable to use the program's Performance Analyzer during a real-environment debug. From what I've read it seems that the Performance Analyzer only works outside of simulated debug sessions if one is using proprietary connector from Keil.

Set a pin high at the start of the function you wish to time, set it low at the end, and measure the pulse width with an oscilloscope.
Dending on which Cortex M4 you're using there may be a cycle count register DWT->CYCCNT, but the inclusion of such is vendor defined. Details can be found in the Cortex M4 Technical Reference Manual. Your processors datasheet, reference manual and programming manual should provide more information if required.
Alternatively, if you have a fast timer, such as the SysTick running from the processor clock, you could initialise the count to 0x00FFFFFF, start it downcounting at the beginning of your function and stop it at the end, you can then work out the time taken as (0x00FFFFFF - SysTick->CVR) * (1 / SysTick Frequency) .

Related

Timer supply to CPU in QEMU

I am trying to emulate the clock control for STM32 machine with CPU cortex m4. It is provided in the STM32 reference manual the clock supplied to the core is by the HCLK.
The RCC feeds the external clock of the Cortex System Timer (SysTick) with the AHB clock (HCLK) divided by 8. The SysTick can work either with this clock or with the Cortex clock (HCLK), configurable in the SysTick control and status register.
Now Cortex m4 is already emulated by QEMU and I am using the same for STM32 emulation. My confusion is should i supply the clock frequency of "HCLK" I have developed for STM32 to send clock pulses to cortex m4 or cortex -m4 itself manages to have its own clock with HCLK clock frequency 168MHz? or the clock frequency is different ?
If I have to pass this frequency to cortex m4, how do i do that?
QEMU's emulation does not generally try to emulate actual clock lines which send pulses at megahertz rates (this would be incredibly inefficient). Instead when the guest programs a timer device the model of the timer device sets up an internal QEMU timer to fire after the appropriate duration (and the handler for that then raises the interrupt line or does whatever is necessary for emulating the hardware behaviour). The duration is calculated from the values the guest has written to the device registers together with a value for what the clock frequency should be.
QEMU doesn't have any infrastructure for handling things like programmable clock dividers or a "clock tree" that routes clock signals around the SoC (one could be added, but nobody has got around to it yet). Instead timer devices are usually either written with a hard-coded frequency, or may be written to have a QOM property that allows the frequency to be set by the board or SoC model code that creates them.
In particular for the SysTick device in the Cortex-M models the current implementation will program the QEMU timer it uses with durations corresponding to a frequency of:
1MHz, if the guest has set the CLKSOURCE bit to 1 (processor clock)
something which the board model has configured via the 'system_clock_scale' global variable (eg 25MHz for the mps2 boards), if the guest has set CLKSOURCE to 0 (external reference clock)
(The system_clock_scale global should be set to NANOSECONDS_PER_SECOND / clk_frq_in_hz.)
The 1MHz is just a silly hardcoded value that nobody has yet bothered to improve upon, because we haven't run into guest code that cares yet. The system_clock_scale global is clunky but works.
None of this affects the speed of the emulated QEMU CPU (ie how many instructions it executes in a given time period). By default QEMU CPUs will run "as fast as possible". You can use the -icount option to specify that you want the CPU to run at a particular rate relative to real time, which sort of implicitly sets the 'cpu frequency', but this will only sort of roughly set an average -- some instructions will run much faster than others, in a not very predictable way. In general QEMU's philosophy is "run guest code as fast as we can", and we don't make any attempt at anything approaching cycle-accurate or otherwise tightly timed emulation.
Update as of 2020: QEMU now has some API and infrastructure for modelling clock trees, which is documented in docs/devel/clocks.rst in the source tree. This is basically a formalized version of the concepts described above, to make it easier for one device to tell another "my clock rate is 20MHz now" without hacks like the "system_clock_scale" global variable or ad-hoc QOM properties.
Systick is supplied via multiplexer and you can choose the AHB bus clock or divided by 8 system timer clock
An old thread and an oft asked question so this should help some of you trying to emulate cortex systems.
If using a .dtb when booting then in your .dts one can add to the 'timers' block a line of clock-frequency = <value>; and recompile it. This will indeed increase the speed of cortex processors. Clearly, value is some large number.

Which MCU(Cortex-M) for time critical GPIO application?

We have an application which runs on PIC24H, we would like to port it to another MCU, preferably ARM Cortex. Application is extremely time critical, meaning that we need extremely deterministic code behaviour. In short, there are pulses which are obtained via special hardware to GPIO pins, data is analyzed right away. Processing of data is not complex(we don't need a beefy cpu/mcu to do it). After analyzing the data GPIO output pins are written to their values.
App in 3 short lines:
process input pins
determine pattern within processing of input pins
based on the received pattern write output pins
PIC24H is working at 40MHz, we can toggle the pin in 25ns, we would be grateful with at least 2x speed for future upgrades. So MCU which can run deterministic code and toggle pins with at least 80MHz (12.5ns) would be just fine. We don't need toggling of the pins at constant fast rate, we need a mcu which can toggle it in less than 25ns. We can't waste cycles while toggling, if one cycle is off we loose synchronization. Everything must be done in one cycle precision(or two but constant two cycles), so code should be 100% deterministic.
Please let me know if I'm missing something or if what we need can be done using some other methods on Cortex-M. Just keep in mind that if one cycle is lost(due cache or similar) we loose signal sync and app will not do it's work right or at all.
Thanks!
Br
According to this blog post, the interrupt latency for Cortex-M ranges from 12 to 16 cycles (assuming you are not using FPU registers) with best-case memories. M0 and M0+ are slower than M3/M4/M7. On top of this, you need to add the GPIO access times (and watch out for different clock frequencies between the core and the peripherals. Cortex-M7 will suppport higher clock speeds than M3/M4.
It still isn't clear how many cycles are consumed in recognising a pattern, and how an interrupt is useful in doing this - generally a low latency interface function like this would be an obvious target for dedicated hardware, but since you have an existing software solution it seems the problem is mis-specified.
Providing you avoid accessing any 'slow' peripherals which might stall the bus, the interrupt latency should be deterministic - any specific device should have documentation which covers this.
NXP have an application note which describes some of the detail of how to measure what is going on.

how to calculate machine cycle time of given micro controller unit

I am working on 8051 MCU from si labs. I want to generate exact 1ms delay using timer. For this I want to know what is the machine cycle time of a given MCU. The time taken by the MCU to complete one machine instruction. Then I can calculate how many machine cycles to complete 1ms delay.
Creating a time delay by counting MCU cycles is a poor method - especially if you are coding in C where you have no control over the machine instructions the compiler will generate - your loop will likely change depending on compiler options such as optimisation level.
Moreover the MCU has no means of measuring its own clock; its only concept of time passing is in clock-cycle units - asking it how long a cycle is is rather like asking a human how long a second is. The answer to the question of how long a clock-cycle is from the point of view of the MCU is always 1.
As the programmer of the system, it is your responsibility to know the clock speed. Typically the hardware defines the speed by its crystal or oscillator rate, and the MCU PLL settings determine the multiplier. Most often you will embed this speed as a constant in the start-up code; your code might access this constant.
Even then, you are better off creating delays using an on-chip timer unit rather than software-based instruction counting (and not all 8051 instructions are single cycle). In that case, you still need to know the clock speed; then the timer clock may be further divided from that.
To use the timer you need to know what is the frequency of the timer clock. Then you just need to : timer_clocks=delay*frequency;
Instruction timings you need to know only if you want blocking delay. There are two sources: uC documentation or experiment. To know how many loops you need just connect the oscilloscope to the pin and loop as many times as needed to archive the required impulse length

Embedded Programming, Wait for 12.5 us

I'm programming on the C2000 F28069 Experimenters Kit. I'm toggling a GPIO output every 12.5 microseconds 5 times in a row. I decided I don't want to use interrupts (though I will if I absolutely have to). I want to just wait that amount of times in terms of clock cycles.
My clock is running at 80MHz, so 12.5 us should be 1000 clock cycles. When I use a loop:
for(i=0;i<1000;i++)
I get a result that is way too long (not 12.5 us). What other techniques can I use?
Is sleep(n); something that I can use on a microcontroller? If so, which header file do I need to download and where can I find it? Also, now that I think about it, sleep(n); takes an int input, so that wouldn't even work... any other ideas?
Summary: Use the PWM or Timer peripherals to generate output pulses.
First, the clock speed of the CPU has a complex relationship to actual code execution speed, and in many CPUs there is more than one clock rate involved in different stages of the execution. The chip you reference has several internal clock sources, for instance. Further, each individual instruction will likely take a different number of clocks to execute, and some cores can execute part of (or all of) several instructions simultaneously.
To rigorously create a loop that required 12.5 µs to execute without using a timing interrupt or other hardware device would require careful hand coding in assembly language along with careful accounting of the execution time of each instruction.
But you are writing in C, not assembler.
So the first question you have to ask is what machine code was actually generated for your loop. And the second question is did you enable the optimizer, and to what level.
As written, a decent optimizer will determine that the loop for (i=0; i<1000; i++) ; has no visible side effects, and therefore is just a slow way of writing ;, and can be completely removed.
If it does compile the loop, it could be written naively using perhaps as many as 5 instructions, or as few as one or two. I am not personally familiar with this particular TI CPU architecture, so I won't attempt to guess at the best possible implementation.
All that said, learning about the CPU architecture and its efficiency is important to building reliable and efficient embedded systems. But given that the chip has peripheral devices built-in that provide hardware support for PWM (pulse width modulated) outputs as well as general purpose hardware timer/counters you would be far better off learning to use the hardware to generate the waveform for you.
I would start by collecting every document available on the CPU core and its peripherals, especially app notes and sample code.
The C compiler will have an option to emit and preserve an assembly language source file. I would use that as a guide to study the structure of the code generated for critical loops and other bottlenecks, as well as the effects of the compiler's various optimization levels.
The tool suite should have a mechanism for profiling your running code. Before embarking on heroic measures in pursuit of optimizations, use that first to identify the actual bottlenecks. Even if it lacks decent profiling, you are likely to have spare GPIO pins that can be toggled around critical sections of code and measured with a logic analyzer or oscilloscope.
The chip you refer has PWM (pulse width modulation) hardware declared as one of major winning features. You should rely on this. Please refer to appropriate application guide. Generally you cannot guarantee 12.5uS periods from application layer (and should not try to do so). Even if you managed to do so directly from application layer it's bad idea. Any change in your firmware code can break this.
If you use a timer peripheral with PWM output capability as suggested by #RBerteig already, then you can generate an accurate timing signal with zero software overhead. If you need to do other work synchronously with the clock, then you can use the timer interrupt to trigger that too. However if you process interrupts at an interval of 12.5us you may find that your processor spends a great deal of time context switching rather than performing useful work.
If you simply want an accurate delay, then you should still use a hardware timer and poll its reload flag rather than process its interrupt. This allows consistent timing independent of the compiler's code generation or processor speed and allows you to add other code within the loop without extending the total loop time. You would poll it in a loop during which you might do other work as well. The timing jitter and determinism will depend on what other work you do in the loop, but for an empty loop, reaction to the timer even will probably be faster than the latency on an interrupt handler.

Entering sleep mode on arm cortex m4

I'm trying to put a cortex m4 processor to sleep for a little less than a second. I want to be able to tell it to sleep, then a second later, or when a button is pressed, pick up right where I left off. I've looked in the reference manual and VLPS mode looks like it would fit my needs. I don't know how to begin to enter that mode or how to program the NVIC.
More Info:
I am doing this in C, on the bare metal.
You can download and inspect the code that implements this demo. Although the demo is for an RTOS the code used to place the CPU into a sleep mode is the same whether an RTOS is being used or the application is running on bare metal.
There are generic things you can do to place a Cortex-M3 core into a low power state (see the WFI instruction). To get extreme low power then you have to do chip specific things as well. The above linked code performs some chip specific pre-sleep processing (turn of peripherals, set the chips own sleep mode, etc.) before calling WFI, then does some chip specific things when it returns from the WFI instruction.
You don't need a RTOS in order to wake up from sleep a Cortex M4, what you need is to use and interrupt (ISR) you should refer to the manufacturer manual, you may wake up with a timer(ISR) or a button(GPIO) depending of the sleep-hibernation modes of your particular chip. Here is a ARM document more in depth about it.
http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0553a/BABGGICD.html

Resources