HCS12 embedded: counter timer and calculated output compare values - timer

I'm having trouble with timer output compare interrupts on the HCS12. The problem seems to be that I'm writing calculated values to the output compare registers rather than immediates, ie...
OCval = x + y ; ldd OC1, OCval ; // what I need to do
ldd OC1, #3000 ; // what works
With the calculated values, the timer interrupt is erratic, which is unacceptable in my application. The problem has been firmly pinned down to the documented requirement to access the timer and OC registers in a single cycle, anything other than an immediate write violates this. I also note that all the sample code on the web uses immediate ops.
Just wondering if there is a software workaround. I need to allow the counter to free run (ie. no reset) because there are other output compares with immediate writes that have to remain in action. Only two of my interrupts need to be calculated.
A software fix would be nice because the only other options I can see involve additional hardware to handle the dynamic timing, messy. TIA

This is a bit tentative, but early tests are encouraging. I've moved the offending interrupts from the main timer to the modulo downcounter, which also provides clocked interrupts. The documentation states that setting the count register is subject to the same single-cycle write rules, however my extensive testing with the main timer indicates that problems are very unlikely to occur until the counter has run for a while with the new setting. The advantage of the new approach is that a value needs to be written only once, on initially setting the time value, unlike the main timer where a rewrite needed to be performed many times a second.
In case it helps, I'm stopping the counter before doing the write, then restarting it.

Related

Calculate MCU load (or free) time during operation

I have a Cortex M0+ chip (STM32 brand) and I want to calculate the load (or free) time. The M0+ doesn't have the DWT->SYSCNT register, so using that isn't an option.
Here's my idea:
Using a scheduler I have, I take a counter and increment it by 1 in my idle loop.
uint32_t counter = 0;
while(1){
sched_run();
}
sched_run(){
if( Jobs_timer_ready(jobs) ){
// do timed jobs
}else{
sched_idle();
}
}
sched_idle(){
counter += 1;
}
I have jobs running on a 50us timer, so I can collect the count every 100ms accurately. With a 64mhz chip, that would give me 64000000 instructions/sec or 64 instructions/usec.
If I take the number of instructions the counter uses and remove that from the total instructions per 100ms, I should have a concept of my load time (or free time). I'm slow at math, but that should be 6,400,000 instructions per 100ms. I haven't actually looked at the instructions that would take but lets be generous and say it takes 7 instructions to increment the counter, just to illustrate the process.
So, let's say the counter variable has ended up with 12,475 after 100ms. Our formula should be [CPU Free %] = Free Time/Max Time = COUNT*COUNT_INSTRUC/MAX_INSTRUC.
This comes out to 12475 * 7/6,400,000 = 87,325/6,400,00 = 0.013644 (x 100) = 1.36% Free (and this is where my math looks very wrong).
My goal is to have a mostly-accurate load percentage that can be calculated in the field. Especially if I hand it off to someone else, or need to check how it's performing. I can't always reproduce field conditions on a bench.
My basic questions are this:
How do I determine load or free?
Can I calculate load/free like a task manager (overall)?
Do I need a scheduler for it or just a timer?
I would recommend to set up a timer to count in 1 us step (or whatever resolution you need). Then just read the counter value before and after the work to get the duration.
Given your simplified program, it looks like you just have a while loop and a flag which tells you when some work needs to be done. So you could do something like this:
uint32_t busy_time = 0;
uint32_t idle_time = 0;
uint32_t idle_start = 0;
while (1) {
// Initialize the idle start timer.
idle_start = TIM2->CNT;
sched_run();
}
void sched_run()
{
if (Jobs_timer_ready(jobs)) {
// When the job starts, calculate the duration of the idle period.
idle_time += TIM2->CNT - idle_start;
// Measure the work duration.
uint32_t job_started = TIM2->CNT;
// do timed jobs
busy_time += TIM2->CNT - job_start;
// Restart idle period.
idle_start = TIM2->CNT;
}
}
The load percentage would be (busy_time / (busy_time + idle_time)) * 100.
Counting cycles isn't as easy as it seems. Reading variable from RAM, modifying it and writing it back has non-deterministic duration. RAM read is typically 2 cycles, but can be 3, depending on many things, including how congested AXIM-bus is (other MCU peripherals are also attached to it). Writing is a whole another story. There are bufferable writes, non-bufferable writes, etc etc. Also, there is caching, which changes things depending on where the executed code is, where the data it's modifying is, and cache policies for data and instructions. There is also an issue of what exactly your compiler generates. So this problem should be approached from a different angle.
I agree to #Armandas that the best solution is a hardware timer. You don't even have to set it up to a microsecond or anything (but you totally can). You can choose when to reset counter. Even if it runs at CPU clock or close to that, 32-bit overflow will take very long (but still must be handled; I would reset the timer counter when I make idle/busy calculation, seems like a reasonable moment to do that; if your program can actually overflow the timer at runtime, you need to come up with modified solution to account for it of course). Obviously, if your timer has 16-bit prescaler and counter, you will have to adjust for that. Microsecond tick seems like a reasonable compromise for your application after all.
Alternative things to consider: DTCM memory - small tightly coupled RAM - has actually strictly one cycle read/write access, it's by definition not cacheable and not bufferable. So with tightly coupled memory and tight control over what exactly instructions are being generated by compiler and executed by CPU, you can do something more deterministic with variable counter. However, if that code is ported to M7, there may be timing-related issues there because of M7's dual issue pipeline (if very simplified, it can execute 2 instructions in parallel at a time, more in Architecture Reference Manual). Just bear this in mind. It becomes a little more architecture dependent. It may or may not be an issue for you.
At the end of the day, I vote stick with hardware timer. Making it work with variable is a huge headache, and you really need to get down to architecture level to make it work properly, and even then there could always be something you forgot/didn't think about. Seems like massive overcomplication for the task at hand. Hardware timer is the boss.

Unexpected timing in Arm assembly

I need a very precise timing, so I wrote some assembly code (for ARM M0+).
However, the timing is not what I expected when measuring on an oscilliscope.
#define LOOP_INSTRS_CNT 4 // subs: 1, cmp: 1, bne: 2 (when branching)
#define FREQ_MHZ (BOARD_BOOTCLOCKRUN_CORE_CLOCK / 1000000)
#define DELAY_US_TO_CYCLES(t_us) ((t_us * FREQ_MHZ + LOOP_INSTRS_CNT / 2) / LOOP_INSTRS_CNT)
static inline __attribute__((always_inline)) void timing_delayCycles(uint32_t loopCnt)
{
// note: not all instructions take one cycle, so in total we have 4 cycles in the loop, except for the last iteration.
__asm volatile(
".syntax unified \t\n" /* we need unified to use subs (not for sub, though) */
"0: \t\n"
"subs %[cyc], #1 \t\n" /* assume cycles > 0 */
"cmp %[cyc], #0 \t\n"
"bne.n 0b\t\n" /* this instruction costs 2 cycles when branching! */
: [cyc]"+r" (loopCnt) /* actually input, but we need a temporary register, so we use a dummy output so we can also write to the input register */
: /* input specified in output */
: /* no clobbers */
);
}
// delay test
#define WAIT_TEST_US 100
gpio_clear(PIN1);
timing_delayCycles(DELAY_US_TO_CYCLES(WAIT_TEST_US));
gpio_set(PIN1);
So pretty basic stuff. However, the delay (measured by setting a GPIO pin low, looping, then setting high again) timing is consistently 50% higher than expected. I tried for low values (1 us giving 1.56 us), up to 500 ms giving 750 ms.
I tried to single step, and the loop really does only the 3 steps: subs (1), cmp (1), branch (2). Paranthesis is number of expected clock cycles.
Can anybody shed light on what is going on here?
After some good suggestions I found the issue can be resolved in two ways:
Run core clock and flash clock at the same frequency (if code is running from flash)
Place the code in the SRAM to avoid flash access wait-states.
Note: If anybody copies the above code, note that you can delete the cmp, since the subs has the s flag set. If doing so, remember to set instruction count to 3 instead of 4. This will give you a better time resolution.
You can't use these processors like you would a PIC, the timing doesn't work like that. I have demonstrated this here many times you can look around, maybe will do it again here, but not right now.
First off these are pipelined so you average performance is one thing, but and once in a loop and things like caching and branch prediction learning and other factors have settled then you can get consistent performance, for that implementation. Ignore any documentation related to clocks per instruction for a pipelined processor now matter how shallow, that is the first problem in understanding why the timing doesn't work as expected.
Alignment plays a role and folks are tired of me beating this drum but I have demonstrated it so many times. You can search for fetch in the cortex-m0 TRM and you should immediately see that this will affect performance based on alignment. If the chip vendor has compiled the core for 16 bit only then that would be predictable or more predictable (ignoring other factors). But if they have compiled in the other features and if prefetching is happening as described, then the placement of the loop in the address space can affect the loop by plus or minus a fetch affecting the total time to complete the loop, which is measurable with or without a scope.
Branch prediction, which didn't show up in the arm docs as arm doing it but the chip vendors are fully free to do this.
Caching. While a cortex-m0+ if this is an STM32 or perhaps other brands as well, there is or may be a cache you can't turn off. Not uncommon for the flash to be half the speed of the processor thus flash wait state settings, but often the zero wait state means zero additional and it takes two clocks to get one fetch done or at least is measurable that execution in flash is half the speed of execution in ram with all other settings the same (system clock speed, etc). ST has a pretty good prefetch/caching solution with some trademarked name and perhaps a patent who knows. And rarely can you turn this off or defeat it so the first time through or the time entering the loop can see a delay and technically a pre-fetcher can slow down the loop (see alignment).
Flash, as mentioned depending on the chip vendor and age of the part it is quite common for the flash to be half speed of the core. And then depending on your clock rates, when you read about the flash settings in the chip doc where it shows what the required wait states were relative to system clock speed that is a key performance indicator both for the flash technology and whether or not you should really be raising the system clock up too high, the flash doesn't get any faster it has a speed limit, sram from my experience can keep up and so far I don't see them with wait states, but flashes used to be two or three settings across the range of clock speeds the part supports, the newer released parts the flashes are tending to cover the whole range for the slower cores like the m0+ but the m7 and such keep getting higher clock rates so you would still expect the vendors to need wait states.
Interrupts/exceptions. Are you running this on an rtos, are there interrupts going on are you increasing and/or guaranteeing that this gets interrupted with a longer delay?
Peripheral timing, the peripherals are not expected to respond to a load or store in a single clock they can take as long as they want and depending on the clocking system and chip vendors IP, in house or purchased, the peripheral might not run at the processor clock rate and be running at a divided rate making things slower. Your code no doubt is calling this function for a delay, and then outside this timing loop you are wiggling a gpio pin to see something on a scope which leads to how you conducted your benchmark and additional problems with that based on factors above and this one.
And other factors I have to remember.
Like high end processors like the x86, full sized ARMs, etc the processor no longer determines performance. The chip and motherboard can/do. You basically cannot feed the pipe constantly there are stalls going on all over the place. Dram is slow thus layers of caching trying to deal with it but caching helps sometimes and hurts others, branch predictors hurt as much as they help. And so on but it is heavily driven by the system outside the processor core as to how well you can feed the core, and then you get into the core's properties with respect to the pipeline and its own fetching strategy. Ideally using the width of the bus rather than the size of the instruction, transaction overhead so multiple widths of the bus is even more ideal that one width, etc.
Causing tight loops like this on any core to have a jerky motion and or be inconsistent in timing when the same machine code is used at different alignments. Now granted for size/power/etc the m0+ has a tiny pipe, but it still should show the affects of this. These are not pics or avrs or msp430s no reason to expect a timing loop to be consistent. At best you can use a timing loop for things like spi and i2c bit banging where you need to be greater than or equal to some time value, but if you need to be accurate or within a range, it is technically possible per implementation if you control many of the factors, but it is often not worth the effort and you have this maintenance issue now or readability or understandability of the code.
So bottom line there is no reason to expect consistent timing. If you happened to get consistent/linear timing, then great. The first thing you want to do is check that when you changed and re-built the code to use a different value for the loop that it didn't affect alignment of this loop.
You show a loop like this
loop:
subs r0,#1
cmp r0,#0
bne loop
on a tangent why the cmp, why not just
loop:
subs r0,#1
bne loop
But second you then claim to be measuring this on a scope, which is good because how you measure things plays into the quality of the benchmark often the benchmark varies because of how it is measured the yardstick is the problem not the thing being measured, or you have problems with both then the measurement is much more inconsistent. Had you used systick or other to measure this depending on how you did that the measurement itself can cause the variation, and even if you used gpio to toggle a pin that can and probably is affecting this as well. All other things held constant simply changing the loop count depending on the immediate and the value used could push you between a thumb and thumb2 instruction changing the alignment of some loop.
What you have shown implies you have this timing loop which can be affected by a number of system issues, then you have wrapped that with some other loop itself being affected, plus possibly a call to a gpio library function which can be affected by these factors as well from a performance perspective. Using inline assembly and the style in which you wrote this function that you posted implies you have exposed yourself and can easily see a wide range of performance differences in running what appears to be the same code, or even actually the code under test being the same machine code.
Unless this is a microchip PIC, not PIC32, or a very very short list of other specific brand and family of chips. Ignore the cycle counts per instruction, assume they are wrong, and don't try for accurate timing unless you control the factors.
Use the hardware, if for example you are trying to use the ws8212/neopixel leds and you have a tight window for timing you are not going to be successful or will have limited success using instruction timing. In that specific case you can sometimes get away with using a spi controller or timers in the part to generate accurately timed (far more than you can ever do with software timers managing the bit banging or otherwise). With a PIC I was able to generate tv infrared signals with the carrier frequency and ons and off using timed loops and nops to make a highly accurate signal. I repeated that for one of these programmable led things for a short number of them on a cortex-m using a long linear list of instructions and relying on execution performance it worked but was extremely limited as it was compile time and quick and dirty. SPI controllers are a pain compared to bit banging but another evening with the SPI controller and could send any length of highly accurately timed signals out.
You need to change your focus to using timers and/or on chip peripherals like uart, spi, i2c in non-normal ways to generate whatever signal this is you are trying to generate. Leave timed loops or even timer based loops wrapped by other loops for the greater than or equal cases and not for the within a range of time cases. If unable to do it with one chip, then look around at others, very often when making a product you have to shop for the components, across vendors, etc. Push comes to shove use a CPLD or a PAL or GAL or something like that to get highly accurate but custom timing. Depending on what you are doing and what your larger system picture looks like the ftdi usb chips with mpsse have a generic state machine that you can program to generate an array of signals, they do i2c, spi, jtag, swd etc with this generic programmable system. But if you don't have a usb host then that won't work.
You didn't specify a chip and I have a lot of different chips/boards handy but only a small fraction of what is out there so if I wanted to do a demo it might not be worth it, if mine has the core compiled one way I might not be able to get it to demonstrate a variation where the same exact core from arm compiled another way on another chip might be easy. I suspect first off a lot of your variation is because you are making calls within a bigger loop, call to delay call to state change the gpio, and you are recompiling that for the experiments. Or worse as shown in your question if you are doing a single pass and not a loop around the calls, then that can maximize the inconsistency.

How Can I save some data before hardware reset of microcontroller?

I'm working on one of Freesacle micro controller. This microcontroller has several reset sources (e.g. clock monitor reset, watchdog reset and ...).
Suppose that because of watchdog, my micro controller is reset. How can I save some data just before reset happens. I mean for example how can I understand that where had been the program counter just before watchdog reset. With this method I want to know where I have error (in another words long process) that causes watchdog reset.
Most Freescale MCUs work like this:
RAM is preserved after watchdog reset. But probably not after LVD reset and certainly not after power-on reset. This is in most cases completely undocumented.
The MCU will either have a status register where you can check the reset cause (for example HCS08, MPC5x, Kinetis), or it will have special reset vectors for different reset causes (for example HC11, HCS12, Coldfire).
There is no way to save anything upon reset. Reset happens and only afterwards can you find out what caused the reset.
It is however possible to reserve a chunk of RAM as a special segment. Upon power-on reset, you can initialize this segment by setting everything to zero. If you get a watchdog reset, you can assume that this RAM segment is still valid and intact. So you don't initialize it, but leave it as it is. This method enables you to save variable values across reset. Probably - this is not well documented for most MCU families. I have used this trick at least on HCS08, HCS12 and MPC56.
As for the program counter, you are out of luck. It is reset with no means to recover it. Meaning that the only way to find out where a watchdog reset occurred is the tedious old school way of moving a breakpoint bit by bit down your code, run the program and check if it reached the breakpoint.
Though in case of modern MCUs like MPC56 or Cortex M, you simply check the trace buffer and see what code that caused the reset. Not only do you get the PC, you get to see the C source code. But you might need a professional, Eclipse-free tool chain to do this.
Depending on your microcontroller you may get Reset Reason, but getting previous program counter (PC/IP) after reset is not possible.
Most of modern microcontrollers have provision for Watchdog Interrupt Instead of reset.
You can configure watchdog peripheral to enable interrupt , In that ISR you can check stored context on stack. ( You can take help from JTAG debugger to check call stack).
There are multiple debugging methods available if your micro-controller dosent support above method.
e.g
In simple while(1) based architecture you can use a HW timer and restart it after some section of code. In Timer ISR you will know which code section is consuming long enough than the timer.
Two things:
Write a log! And rotate that log to keep the last 30 min. or whatever reasonable amount of time you think you need to reproduce the error. Where the log stops, you can see what happened just before that. Even in production-level devices there is some level of logging.
(Less, practical) You can attach a debugger to nearly every micrcontroller and step through the code. Probably put a break-point that is hit just before you enter the critical section of code. Some IDEs/uCs allow having "data-breakpoints" that get triggered when certain variables contain certain values.
Disclaimer: I am not familiar with the exact microcontroller that you are using.
It is written in your manual.
I don't know that specific processor but in most microprocessors a watchdog reset is a soft reset, meaning that certain registers will keep information about the reset source and sometimes reason.
You need to post more specific information on your Freescale μC for this be answered properly.
Even if you could get the Program Counter before reset, it wouldn't be advisable to blindly set the program counter to another after reset --- as there would likely have been stack and heap information as well as the data itself may also have changed.
It depends on what you want to preserve after reset, certain behaviour or data? Volatile memory may or may not have been cleared after watchdog (see your uC datasheet) and you will be able to detect a reset after checking reset registers (again see your uC datasheet). By detecting a reset and checking volatile memory you may be able to prepare your uC to restart in a way that you'd prefer after the unlikely event of a reset occurring. You could create a global value and set it to a particular value in global scope, then if it resets, check the value against it when a reset event occurs -- if it is the same, you could assume other memory may also be the same. If volatile memory is not an option you'll need to have a look at the datasheet for non-volatile options, however it is also advisable not to continually write to non-volatile memory due to writing limitations.
The only reliable solution is to use a debugger with trace capability if your chip supports embedded instruction trace.
Some devices have an option to redirect the watchdog timeout to an interrupt rather then a reset. This would allow you to write the watchdog timeout handler much like an exception handler and dump or store the stack information including the return address which will indicate the location the interrupt occurred.
However in some cases, neither solution is a reliable method of achieving your aim. In a multi-tasking environment or system with interrupt handlers, the code running when the watchdog timeout occurs may not be the process that is causing the problem.

How do you know when a micro-controller reset?

I am learning embedded systems on the ARM9 processor (SAM9G20). I am more familiar with procedural programming for general purpose. Thus what I am doing is going through the data sheet and learning what registers there are and how to manipulate them.
My question is, how do I know when the computer reset? I know that there is a Reset Controller that manages resets. A register called the Status Register (RSTC_SR) stores the source of the reset. Do I need to keep periodically reading this register?
My solution is to store the number of resets in the FRAM (or start by setting it to 0), once a reset happens, I compare this variable with the register value in my main function. If the register value is higher then obviously it reset. However I am sure there is a more optimized way (perhaps using interrupts). Or is this how its usually done?
You do not need to periodically check, since every time the machine is reset your program will re-start from the beginning.
Simply add checks to the startup code, i.e. early in main(), as needed. If you want to figure out things like how often you reset, then that is more difficult since typically (no experience with SAMs, I'm an STM32 type of guy) on-board timers etc will also reset. Best would be some kind of real-world independent clock, like an RTC that you can poll and save the value of. Please consider if you really need this, though.
A simple solution is to exploit the structure of your code.
Many code bases for embedded take this form:
int main(void)
{
// setup stuff here
while (1)
{
// handle stuff here
}
return 0;
}
You can exploit that the code above while(1) is only run once at startup. You could increment a counter there, and save it in non-volatile storage. That would tell you how many times the microcontroller has reset.
Another example is on Arduino, where the code is structured such that a function called setup() is called once, and a function called loop() is called continuously. With this structure, you could increment the variable in the setup()-function to achieve the same effect.
Whenever your processor starts up, it has by definition come out of reset. What the reset status register does is indicate the source or reason for the reset, such as power-on, watchdog-timer, brown-out, software-instruction, reset-pin etc.
It is not a matter of knowing when your processor has reset - that is implicit by the fact that your code has restarted. It is rather a matter of knowing the cause of the reset.
You need not monitor or read the reset status at all if your application has no need of it, but in some applications perhaps it is a useful diagnostic for example to maintain a count of various reset causes as it may be indicative of the stability of your system software, its power-supply or the behaviour of the operators. Ideally you'd want to log the cause with a timestamp assuming you have an suitable RTC source early enough in your start-up. The timing of resets is often a useful diagnostic where simply counting them may not be.
Any counting of the reset cause should occur early in your code start-up before any interrupts are enabled (because an interrupt may itself cause a reset). This may require you to implement the counters in the start-up code before main() is invoked in cases where the start-up code might enable interrupts - for stdio or filesystem support fro example.
A way to do this is to run the code in debug mode (if you got a debugger for the SAM). After a reset the program counter(PC) points to the address where your code starts.

Repeat an instruction a certain number of times without a "loop"?

I've used PICs before and now I'm using with STM32F415.
On a time-critical part of my code I need to put a very exact delay to adjust the period of the DAC-DMA that are working together to create a periodic analog signal.
The delay I want to add goes from 0 to 63 clock cycles (If I were able to do 10-63 clock cycles it would be OK aswell). In PIC24F assembly, there's the instruction "REPEAT" which allows me to repeat the next instruction a certain number of times. That would work great for me as I'd be able to just do:
REPEAT #0xNUMBER
NOP
I'm trying to find something similar with the STM32F4, but I had no luck searching in the instruction set, Reference Manual, and on the Internet in general.
I've already tried to use for/while loops in C and a timer dedicated to it, but the extra instructiosn required consume too much time (40-50 cycles depending on the way I program it).
If someone has an idea or knows how to do it, It would be very useful for me.
Thanks a lot.
English is not my mother-tongue language so I'm sorry for any possible mistakes. Let me know and I'll try to improve it :)
EDIT 1 (23-jul-17)
Thanks to everyone answering, I've been very busy and couldn't answer every one of you individually.
I'll try #berendi solution of gated clocks, it seems as the best fit for my application.
I'm learning a lot of things about the STM32 I didn't know, thank you everyone!
I stop the timer that makes de DMA-DAC work, do the delay and then
enable it again.
So, if I'm understanding it correctly, you have Timer A controlling your DAC, triggering a conversion at each counter overflow, and you'd like to delay it for a variable number of clock cycles.
Most (if not all) timers of the STM32F4 support gated mode slave operation, where you can select another timer (Timer B) as a master, and Timer A counts only as long as the trigger output of Timer B is low. In other words, Timer A will stop counting on a rising edge from Timer B, and resume counting on a falling edge. Now, configure Timer B to output a single pulse when enabled, where the pulse width is the delay you want, then Timer A will be delayed for the exact duration of the pulse.
See the chapters on One-pulse mode, Timers and external trigger synchronization, and the description of the CR1, CR2, and SMCR registers in the reference manual.
NOP is not a very good solution for the delay. Use barrier instructions instead as the execution time is exactly as stated in the ARM documentation (3, 4 or 5 cycles depending what instruction and the core version). You can place n consecutive barriers to archive the delay you need
On a PIC you can do this and it is a very common solution, the execution time was deterministic. Outside architectures like that, and older chips that were also deterministic (before clones came out) that would be okay as well. But in general that is not how you do a delay, it is not deterministic, you can get an "at least this long" for a tuned loop, but you cant get "exactly this long" even tuned, or should never expect to. That is why there are timers, multiple usually, in mcu designs and that is what you use for measuring time. For the problem you are trying to solve that is the solution here, one timer or cascaded timers if you really need that.
Arm does not have an x86 like repeat instruction your smallest loop is going to be two instructions and I have countless times demonstrated that on the same chip this loop can vary in speed, so tune it, add a line of code and the delay properties of this loop change
here:
sub r0,#1
bne here
for classic (gas) syntax, for unified syntax use subs instead of sub.
You are also on an stm32 where they have a buried cache on the instruction side that you cannot turn off nor control, it generally gives you no wait state performance, certainly for things like this but obviously they dont have a cache the size of the flash so pre-fetch cycles have to happen somewhere, and you have to expect that sometimes you are going to have to feel that prefetch when you jump into this loop.

Resources