Unexpected timing in Arm assembly

Unexpected timing in Arm assembly - c

I need a very precise timing, so I wrote some assembly code (for ARM M0+).
However, the timing is not what I expected when measuring on an oscilliscope.
#define LOOP_INSTRS_CNT 4 // subs: 1, cmp: 1, bne: 2 (when branching)
#define FREQ_MHZ (BOARD_BOOTCLOCKRUN_CORE_CLOCK / 1000000)
#define DELAY_US_TO_CYCLES(t_us) ((t_us * FREQ_MHZ + LOOP_INSTRS_CNT / 2) / LOOP_INSTRS_CNT)
static inline __attribute__((always_inline)) void timing_delayCycles(uint32_t loopCnt)
{
// note: not all instructions take one cycle, so in total we have 4 cycles in the loop, except for the last iteration.
__asm volatile(
".syntax unified \t\n" /* we need unified to use subs (not for sub, though) */
"0: \t\n"
"subs %[cyc], #1 \t\n" /* assume cycles > 0 */
"cmp %[cyc], #0 \t\n"
"bne.n 0b\t\n" /* this instruction costs 2 cycles when branching! */
: [cyc]"+r" (loopCnt) /* actually input, but we need a temporary register, so we use a dummy output so we can also write to the input register */
: /* input specified in output */
: /* no clobbers */
);
}
// delay test
#define WAIT_TEST_US 100
gpio_clear(PIN1);
timing_delayCycles(DELAY_US_TO_CYCLES(WAIT_TEST_US));
gpio_set(PIN1);
So pretty basic stuff. However, the delay (measured by setting a GPIO pin low, looping, then setting high again) timing is consistently 50% higher than expected. I tried for low values (1 us giving 1.56 us), up to 500 ms giving 750 ms.
I tried to single step, and the loop really does only the 3 steps: subs (1), cmp (1), branch (2). Paranthesis is number of expected clock cycles.
Can anybody shed light on what is going on here?

After some good suggestions I found the issue can be resolved in two ways:
Run core clock and flash clock at the same frequency (if code is running from flash)
Place the code in the SRAM to avoid flash access wait-states.
Note: If anybody copies the above code, note that you can delete the cmp, since the subs has the s flag set. If doing so, remember to set instruction count to 3 instead of 4. This will give you a better time resolution.

You can't use these processors like you would a PIC, the timing doesn't work like that. I have demonstrated this here many times you can look around, maybe will do it again here, but not right now.
First off these are pipelined so you average performance is one thing, but and once in a loop and things like caching and branch prediction learning and other factors have settled then you can get consistent performance, for that implementation. Ignore any documentation related to clocks per instruction for a pipelined processor now matter how shallow, that is the first problem in understanding why the timing doesn't work as expected.
Alignment plays a role and folks are tired of me beating this drum but I have demonstrated it so many times. You can search for fetch in the cortex-m0 TRM and you should immediately see that this will affect performance based on alignment. If the chip vendor has compiled the core for 16 bit only then that would be predictable or more predictable (ignoring other factors). But if they have compiled in the other features and if prefetching is happening as described, then the placement of the loop in the address space can affect the loop by plus or minus a fetch affecting the total time to complete the loop, which is measurable with or without a scope.
Branch prediction, which didn't show up in the arm docs as arm doing it but the chip vendors are fully free to do this.
Caching. While a cortex-m0+ if this is an STM32 or perhaps other brands as well, there is or may be a cache you can't turn off. Not uncommon for the flash to be half the speed of the processor thus flash wait state settings, but often the zero wait state means zero additional and it takes two clocks to get one fetch done or at least is measurable that execution in flash is half the speed of execution in ram with all other settings the same (system clock speed, etc). ST has a pretty good prefetch/caching solution with some trademarked name and perhaps a patent who knows. And rarely can you turn this off or defeat it so the first time through or the time entering the loop can see a delay and technically a pre-fetcher can slow down the loop (see alignment).
Flash, as mentioned depending on the chip vendor and age of the part it is quite common for the flash to be half speed of the core. And then depending on your clock rates, when you read about the flash settings in the chip doc where it shows what the required wait states were relative to system clock speed that is a key performance indicator both for the flash technology and whether or not you should really be raising the system clock up too high, the flash doesn't get any faster it has a speed limit, sram from my experience can keep up and so far I don't see them with wait states, but flashes used to be two or three settings across the range of clock speeds the part supports, the newer released parts the flashes are tending to cover the whole range for the slower cores like the m0+ but the m7 and such keep getting higher clock rates so you would still expect the vendors to need wait states.
Interrupts/exceptions. Are you running this on an rtos, are there interrupts going on are you increasing and/or guaranteeing that this gets interrupted with a longer delay?
Peripheral timing, the peripherals are not expected to respond to a load or store in a single clock they can take as long as they want and depending on the clocking system and chip vendors IP, in house or purchased, the peripheral might not run at the processor clock rate and be running at a divided rate making things slower. Your code no doubt is calling this function for a delay, and then outside this timing loop you are wiggling a gpio pin to see something on a scope which leads to how you conducted your benchmark and additional problems with that based on factors above and this one.
And other factors I have to remember.
Like high end processors like the x86, full sized ARMs, etc the processor no longer determines performance. The chip and motherboard can/do. You basically cannot feed the pipe constantly there are stalls going on all over the place. Dram is slow thus layers of caching trying to deal with it but caching helps sometimes and hurts others, branch predictors hurt as much as they help. And so on but it is heavily driven by the system outside the processor core as to how well you can feed the core, and then you get into the core's properties with respect to the pipeline and its own fetching strategy. Ideally using the width of the bus rather than the size of the instruction, transaction overhead so multiple widths of the bus is even more ideal that one width, etc.
Causing tight loops like this on any core to have a jerky motion and or be inconsistent in timing when the same machine code is used at different alignments. Now granted for size/power/etc the m0+ has a tiny pipe, but it still should show the affects of this. These are not pics or avrs or msp430s no reason to expect a timing loop to be consistent. At best you can use a timing loop for things like spi and i2c bit banging where you need to be greater than or equal to some time value, but if you need to be accurate or within a range, it is technically possible per implementation if you control many of the factors, but it is often not worth the effort and you have this maintenance issue now or readability or understandability of the code.
So bottom line there is no reason to expect consistent timing. If you happened to get consistent/linear timing, then great. The first thing you want to do is check that when you changed and re-built the code to use a different value for the loop that it didn't affect alignment of this loop.
You show a loop like this
loop:
subs r0,#1
cmp r0,#0
bne loop
on a tangent why the cmp, why not just
loop:
subs r0,#1
bne loop
But second you then claim to be measuring this on a scope, which is good because how you measure things plays into the quality of the benchmark often the benchmark varies because of how it is measured the yardstick is the problem not the thing being measured, or you have problems with both then the measurement is much more inconsistent. Had you used systick or other to measure this depending on how you did that the measurement itself can cause the variation, and even if you used gpio to toggle a pin that can and probably is affecting this as well. All other things held constant simply changing the loop count depending on the immediate and the value used could push you between a thumb and thumb2 instruction changing the alignment of some loop.
What you have shown implies you have this timing loop which can be affected by a number of system issues, then you have wrapped that with some other loop itself being affected, plus possibly a call to a gpio library function which can be affected by these factors as well from a performance perspective. Using inline assembly and the style in which you wrote this function that you posted implies you have exposed yourself and can easily see a wide range of performance differences in running what appears to be the same code, or even actually the code under test being the same machine code.
Unless this is a microchip PIC, not PIC32, or a very very short list of other specific brand and family of chips. Ignore the cycle counts per instruction, assume they are wrong, and don't try for accurate timing unless you control the factors.
Use the hardware, if for example you are trying to use the ws8212/neopixel leds and you have a tight window for timing you are not going to be successful or will have limited success using instruction timing. In that specific case you can sometimes get away with using a spi controller or timers in the part to generate accurately timed (far more than you can ever do with software timers managing the bit banging or otherwise). With a PIC I was able to generate tv infrared signals with the carrier frequency and ons and off using timed loops and nops to make a highly accurate signal. I repeated that for one of these programmable led things for a short number of them on a cortex-m using a long linear list of instructions and relying on execution performance it worked but was extremely limited as it was compile time and quick and dirty. SPI controllers are a pain compared to bit banging but another evening with the SPI controller and could send any length of highly accurately timed signals out.
You need to change your focus to using timers and/or on chip peripherals like uart, spi, i2c in non-normal ways to generate whatever signal this is you are trying to generate. Leave timed loops or even timer based loops wrapped by other loops for the greater than or equal cases and not for the within a range of time cases. If unable to do it with one chip, then look around at others, very often when making a product you have to shop for the components, across vendors, etc. Push comes to shove use a CPLD or a PAL or GAL or something like that to get highly accurate but custom timing. Depending on what you are doing and what your larger system picture looks like the ftdi usb chips with mpsse have a generic state machine that you can program to generate an array of signals, they do i2c, spi, jtag, swd etc with this generic programmable system. But if you don't have a usb host then that won't work.
You didn't specify a chip and I have a lot of different chips/boards handy but only a small fraction of what is out there so if I wanted to do a demo it might not be worth it, if mine has the core compiled one way I might not be able to get it to demonstrate a variation where the same exact core from arm compiled another way on another chip might be easy. I suspect first off a lot of your variation is because you are making calls within a bigger loop, call to delay call to state change the gpio, and you are recompiling that for the experiments. Or worse as shown in your question if you are doing a single pass and not a loop around the calls, then that can maximize the inconsistency.

Related

How to use perf or other utilities to measure the elapsed time of a loop/function

I am working on a project requiring profiling the target applications at first.
What I want to know is the exact time consumed by a loop body/function. The platform is BeagleBone Black board with Debian OS and installed perf_4.9.
gettimeofday() can provide a micro-second resolution but I still want more accurate results. It looks perf can give cycles statistics and thus be a good fit for purposes. However, perf can only analyze the whole application instead of individual loop/functions.
After trying the instructions posted in this Using perf probe to monitor performance stats during a particular function, it does not work well.
I am just wondering if there is any example application in C I can test and use on this board for my purpose. Thank you!

Caveat: This is more of comment than an answer but it's a bit too long for just a comment.
Thanks a lot for advising a new function. I tried that but get a little unsure about its accuracy. Yes, it can offer nanosecond resolution but there is inconsistency.
There will be inconsistency if you use two different clock sources.
What I do is first use clock_gettime() to measure a loop body, the approximate elasped time would be around 1.4us in this way. Then I put GPIO instructions, pull high and pull down, at beginning and end of the loop body, respectively and measure the signal frequency on this GPIO with an oscilloscope.
A scope is useful if you're trying to debug the hardware. It can also show what's on the pins. But, in 40+ years of doing performance measurement/improvement/tuning, I've never used it to tune software.
In fact, I would trust the CPU clock more than I would trust the scope for software performance numbers
For a production product, you may have to measure performance on a system deployed at a customer site [because the issue only shows up on that one customer's machine]. You may have to debug this remotely and can't hook up a scope there. So, you need something that can work without external probe/test rigs.
To my surprise, the frequency is around 1.8MHz, i.e., ~500ns. This inconsistency makes me a little confused... – GeekTao
The difference could be just round off error based on different time bases and latency in getting in/out of the device (GPIO pins). I presume you're just using GPIO in this way to facilitate benchmark/timing. So, in a way, you're not measuring the "real" system, but the system with the GPIO overhead.
In tuning, one is less concerned with absolute values than relative. That is, clock_gettime is ultimately based on number of highres clock ticks (at 1ns/tick or better from the system's free running TSC (time stamp counter)). What the clock frequency actually is doesn't matter as much. If you measure a loop/function and get X duration. Then, you change some code and get X+n, this tells you whether the code got faster or slower.
500ns isn't that large an amount. Almost any system wide action (e.g. timeslicing, syscall, task switch, etc.) could account for that. Unless you've mapped the GPIO registers into app memory, the syscall overhead could dwarf that.
In fact, just the overhead of calling clock_gettime could account for that.
Although the clock_gettime is technically a syscall, linux will map the code directly into the app's code via the VDSO mechanism so there is no syscall overhead. But, even the userspace code has some calculations to do.
For example, I have two x86 PCs. On one system the overhead of the call is 26 ns. On another system, the overhead is 1800 ns. Both these systems are at least 2GHz+
For your beaglebone/arm system, the base clock rate may be less, so overhead of 500 ns may be ballpark.
I usually benchmark the overhead and subtract it out from the calculations.
And, on the x86, the actual code just gets the CPU's TSC value (via the rdtsc instruction) and does some adjustment. For arm, it has a similar H/W register but requires special care to map userspace access to it (a coprocessor instruction, IIRC).
Speaking of arm, I was doing a commercial arm product (an nVidia Jetson to be exact). We were very concerned about latency of incoming video frames.
The H/W engineer didn't trust TSC [or software in general ;-)] and was trying to use a scope, an LED [controlled by a GPIO pin] and when the LED flash/pulse showed up inside the video frame (e.g. the coordinates of the white dot in the video frame were [effectively] a time measurement).
It took a while to convince the engineer, but, eventually I was able to prove that the clock_gettime/TSC approach was more accurate/reliable.
And, certainly, easier to use. We had multiple test/development/SDK boards but could only hook up the scope/LED rig on one at a time.

Which MCU(Cortex-M) for time critical GPIO application?

We have an application which runs on PIC24H, we would like to port it to another MCU, preferably ARM Cortex. Application is extremely time critical, meaning that we need extremely deterministic code behaviour. In short, there are pulses which are obtained via special hardware to GPIO pins, data is analyzed right away. Processing of data is not complex(we don't need a beefy cpu/mcu to do it). After analyzing the data GPIO output pins are written to their values.
App in 3 short lines:
process input pins
determine pattern within processing of input pins
based on the received pattern write output pins
PIC24H is working at 40MHz, we can toggle the pin in 25ns, we would be grateful with at least 2x speed for future upgrades. So MCU which can run deterministic code and toggle pins with at least 80MHz (12.5ns) would be just fine. We don't need toggling of the pins at constant fast rate, we need a mcu which can toggle it in less than 25ns. We can't waste cycles while toggling, if one cycle is off we loose synchronization. Everything must be done in one cycle precision(or two but constant two cycles), so code should be 100% deterministic.
Please let me know if I'm missing something or if what we need can be done using some other methods on Cortex-M. Just keep in mind that if one cycle is lost(due cache or similar) we loose signal sync and app will not do it's work right or at all.
Thanks!
Br

According to this blog post, the interrupt latency for Cortex-M ranges from 12 to 16 cycles (assuming you are not using FPU registers) with best-case memories. M0 and M0+ are slower than M3/M4/M7. On top of this, you need to add the GPIO access times (and watch out for different clock frequencies between the core and the peripherals. Cortex-M7 will suppport higher clock speeds than M3/M4.
It still isn't clear how many cycles are consumed in recognising a pattern, and how an interrupt is useful in doing this - generally a low latency interface function like this would be an obvious target for dedicated hardware, but since you have an existing software solution it seems the problem is mis-specified.
Providing you avoid accessing any 'slow' peripherals which might stall the bus, the interrupt latency should be deterministic - any specific device should have documentation which covers this.
NXP have an application note which describes some of the detail of how to measure what is going on.

Repeat an instruction a certain number of times without a "loop"?

I've used PICs before and now I'm using with STM32F415.
On a time-critical part of my code I need to put a very exact delay to adjust the period of the DAC-DMA that are working together to create a periodic analog signal.
The delay I want to add goes from 0 to 63 clock cycles (If I were able to do 10-63 clock cycles it would be OK aswell). In PIC24F assembly, there's the instruction "REPEAT" which allows me to repeat the next instruction a certain number of times. That would work great for me as I'd be able to just do:
REPEAT #0xNUMBER
NOP
I'm trying to find something similar with the STM32F4, but I had no luck searching in the instruction set, Reference Manual, and on the Internet in general.
I've already tried to use for/while loops in C and a timer dedicated to it, but the extra instructiosn required consume too much time (40-50 cycles depending on the way I program it).
If someone has an idea or knows how to do it, It would be very useful for me.
Thanks a lot.
English is not my mother-tongue language so I'm sorry for any possible mistakes. Let me know and I'll try to improve it :)
EDIT 1 (23-jul-17)
Thanks to everyone answering, I've been very busy and couldn't answer every one of you individually.
I'll try #berendi solution of gated clocks, it seems as the best fit for my application.
I'm learning a lot of things about the STM32 I didn't know, thank you everyone!

I stop the timer that makes de DMA-DAC work, do the delay and then
enable it again.
So, if I'm understanding it correctly, you have Timer A controlling your DAC, triggering a conversion at each counter overflow, and you'd like to delay it for a variable number of clock cycles.
Most (if not all) timers of the STM32F4 support gated mode slave operation, where you can select another timer (Timer B) as a master, and Timer A counts only as long as the trigger output of Timer B is low. In other words, Timer A will stop counting on a rising edge from Timer B, and resume counting on a falling edge. Now, configure Timer B to output a single pulse when enabled, where the pulse width is the delay you want, then Timer A will be delayed for the exact duration of the pulse.
See the chapters on One-pulse mode, Timers and external trigger synchronization, and the description of the CR1, CR2, and SMCR registers in the reference manual.

NOP is not a very good solution for the delay. Use barrier instructions instead as the execution time is exactly as stated in the ARM documentation (3, 4 or 5 cycles depending what instruction and the core version). You can place n consecutive barriers to archive the delay you need

On a PIC you can do this and it is a very common solution, the execution time was deterministic. Outside architectures like that, and older chips that were also deterministic (before clones came out) that would be okay as well. But in general that is not how you do a delay, it is not deterministic, you can get an "at least this long" for a tuned loop, but you cant get "exactly this long" even tuned, or should never expect to. That is why there are timers, multiple usually, in mcu designs and that is what you use for measuring time. For the problem you are trying to solve that is the solution here, one timer or cascaded timers if you really need that.
Arm does not have an x86 like repeat instruction your smallest loop is going to be two instructions and I have countless times demonstrated that on the same chip this loop can vary in speed, so tune it, add a line of code and the delay properties of this loop change
here:
sub r0,#1
bne here
for classic (gas) syntax, for unified syntax use subs instead of sub.
You are also on an stm32 where they have a buried cache on the instruction side that you cannot turn off nor control, it generally gives you no wait state performance, certainly for things like this but obviously they dont have a cache the size of the flash so pre-fetch cycles have to happen somewhere, and you have to expect that sometimes you are going to have to feel that prefetch when you jump into this loop.

Embedded Programming, Wait for 12.5 us

I'm programming on the C2000 F28069 Experimenters Kit. I'm toggling a GPIO output every 12.5 microseconds 5 times in a row. I decided I don't want to use interrupts (though I will if I absolutely have to). I want to just wait that amount of times in terms of clock cycles.
My clock is running at 80MHz, so 12.5 us should be 1000 clock cycles. When I use a loop:
for(i=0;i<1000;i++)
I get a result that is way too long (not 12.5 us). What other techniques can I use?
Is sleep(n); something that I can use on a microcontroller? If so, which header file do I need to download and where can I find it? Also, now that I think about it, sleep(n); takes an int input, so that wouldn't even work... any other ideas?

Summary: Use the PWM or Timer peripherals to generate output pulses.
First, the clock speed of the CPU has a complex relationship to actual code execution speed, and in many CPUs there is more than one clock rate involved in different stages of the execution. The chip you reference has several internal clock sources, for instance. Further, each individual instruction will likely take a different number of clocks to execute, and some cores can execute part of (or all of) several instructions simultaneously.
To rigorously create a loop that required 12.5 µs to execute without using a timing interrupt or other hardware device would require careful hand coding in assembly language along with careful accounting of the execution time of each instruction.
But you are writing in C, not assembler.
So the first question you have to ask is what machine code was actually generated for your loop. And the second question is did you enable the optimizer, and to what level.
As written, a decent optimizer will determine that the loop for (i=0; i<1000; i++) ; has no visible side effects, and therefore is just a slow way of writing ;, and can be completely removed.
If it does compile the loop, it could be written naively using perhaps as many as 5 instructions, or as few as one or two. I am not personally familiar with this particular TI CPU architecture, so I won't attempt to guess at the best possible implementation.
All that said, learning about the CPU architecture and its efficiency is important to building reliable and efficient embedded systems. But given that the chip has peripheral devices built-in that provide hardware support for PWM (pulse width modulated) outputs as well as general purpose hardware timer/counters you would be far better off learning to use the hardware to generate the waveform for you.
I would start by collecting every document available on the CPU core and its peripherals, especially app notes and sample code.
The C compiler will have an option to emit and preserve an assembly language source file. I would use that as a guide to study the structure of the code generated for critical loops and other bottlenecks, as well as the effects of the compiler's various optimization levels.
The tool suite should have a mechanism for profiling your running code. Before embarking on heroic measures in pursuit of optimizations, use that first to identify the actual bottlenecks. Even if it lacks decent profiling, you are likely to have spare GPIO pins that can be toggled around critical sections of code and measured with a logic analyzer or oscilloscope.

The chip you refer has PWM (pulse width modulation) hardware declared as one of major winning features. You should rely on this. Please refer to appropriate application guide. Generally you cannot guarantee 12.5uS periods from application layer (and should not try to do so). Even if you managed to do so directly from application layer it's bad idea. Any change in your firmware code can break this.

If you use a timer peripheral with PWM output capability as suggested by #RBerteig already, then you can generate an accurate timing signal with zero software overhead. If you need to do other work synchronously with the clock, then you can use the timer interrupt to trigger that too. However if you process interrupts at an interval of 12.5us you may find that your processor spends a great deal of time context switching rather than performing useful work.
If you simply want an accurate delay, then you should still use a hardware timer and poll its reload flag rather than process its interrupt. This allows consistent timing independent of the compiler's code generation or processor speed and allows you to add other code within the loop without extending the total loop time. You would poll it in a loop during which you might do other work as well. The timing jitter and determinism will depend on what other work you do in the loop, but for an empty loop, reaction to the timer even will probably be faster than the latency on an interrupt handler.

ARM926EJ-S cycle-counter

Im using an ARM926EJ-S and am trying to figure out whether the ARM can give (e.g. a readable register) the CPU's cycle-counter. I guess a # that will represent the number of cycles since the CPU has been powered.
In my system i have only Low-Res external RTC/Timers. I would like to be able to achieve a Hi-Res timer.
Many thanks in advance!

You probably have only two choices:
Use an instruction-cycle accurate simulator; the problem here is that effectively simulating peripherals and external stimulus can be complex or impossible.
Use a peripheral hardware timer. In most cases you will not be able to run such a timer at the typical core clock rate of an ARM9, and there will be an over head in servicing the timer either side of the period being timed, but it can be used to give execution time over larger or longer running sections of code, which may be of more practical use than cycle count.
While cycle count may be somewhat scalable to different clock rates, it remains constrained by memory and I/O wait states, so is perhaps not as useful as it may seem as a performance metric, except at the micro-level of analysis, and larger performance gains are typically to be had by taking a wider view.

The arm-9 is not equipped with an PMU (Performance Monitoring Unit) as included in the Cortex-family. The PMU is described here. The linux kernel comes equipped with support for using the PMU for benchmarking performance. See here for documentation of the perf tool-set.
Bit unsure about the arm-9, need to dig a bit more...

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight