Performance benefit when using DMA for PWM

Performance benefit when using DMA for PWM - c

I have a segment of code below as a FreeRTOS task running on an STM32F411RE microcontroller:
static void TaskADCPWM(void *argument)
{
/* Variables used by FreeRTOS to set delays of 50ms periodically */
const TickType_t DelayFrequency = pdMS_TO_TICKS(50);
TickType_t LastActiveTime;
/* Update the variable RawAdcValue through DMA */
HAL_ADC_Start_DMA(&hadc1, (uint32_t*)&RawAdcValue, 1);
#if PWM_DMA_ON
/* Initialize PWM CHANNEL2 with DMA, to automatically change TIMx->CCR by updating a variable */
HAL_TIM_PWM_Start_DMA(&htim3, TIM_CHANNEL_2, (uint32_t*)&RawPWMThresh, 1);
#else
/* If DMA is not used, user must update TIMx->CCRy manually to alter duty cycle */
HAL_TIM_PWM_Start(&htim3, TIM_CHANNEL_2);
#endif
while(1)
{
/* Record last wakeup time and use it to perform blocking delay the next 50ms */
LastActiveTime = xTaskGetTickCount();
vTaskDelayUntil(&LastActiveTime, DelayFrequency);
/* Perform scaling conversion based on ADC input, and feed value into PWM CCR register */
#if PWM_DMA_ON
RawPWMThresh = (uint16_t)((RawAdcValue * MAX_TIM3_PWM_VALUE)/MAX_ADC_12BIT_VALUE);
#else
TIM3->CCR2 = (uint16_t)((RawAdcValue * MAX_TIM3_PWM_VALUE)/MAX_ADC_12BIT_VALUE);
#endif
}
}
The task above uses RawAdcValue value to update a TIM3->CCR2 register either through DMA or manually. The RawAdcValue gets updated periodically through DMA, and the value stored in this variable is 12-bits wide.
I understand how using DMA could benefit reading the ADC samples above as the CPU will not need to poll/wait for the ADC samples, or using the DMA to transfer long streams of data through I2C or SPI. But, is there a significant performance advantage to using DMA to update the TIM3->CCR2 register instead of manually modifying the TIM3->CCR2 register through:
TIM3->CCR2 &= ~0xFFFF;
TIM3->CCR2 |= SomeValue;
What would be the main differences between updating the CCR register through DMA or non-DMA?

Let's start by assuming you need to achieve "N samples per second". E.g. for audio this might be 44100 samples per second.
For PWM, you need to change the state of the output multiple times per sample. For example; for audio this might mean writing to the CCR around four times per sample, or "4*44100 = 176400" times per second.
Now look at what vTaskDelayUntil() does - most likely it sets up a timer and does a task switch, then (when the timer expires) you get an IRQ followed by a second task switch. It might add up to a total overhead of 500 CPU cycles each time you change the CCR. You can convert this into a percentage. E.g. (continuing the audio example), "176400 CCR updates per second * 500 cycles per update = about 88.2 million cycles per second of overhead", then, for 100 MHz CPU, you can do "88.2 million / 100 million = 88.2% of all CPU time wasted because you didn't use DMA".
The next step is to figure out where the CPU time comes from. There's 2 possibilities:
a) If your task is the highest priority task in the system (including being higher priority than all IRQs, etc); then every other task will become victims of your time consumption. In this case you've single-handedly ruined any point of bothering with a real time OS (probably better to just use a faster/more efficient non-real-time OS that optimizes "average case" instead of optimizing "worst case", and using DMA, and using a less powerful/cheaper CPU, to get a much better end result at a reduced "cost in $").
b) If your task isn't the highest priority task in the system, then the code shown above is broken. Specifically, an IRQ (and possibly a task switch/preemption) can occur immediately after the vTaskDelayUntil(&LastActiveTime, DelayFrequency);, causing theTIM3->CCR2 = (uint16_t)((RawAdcValue * MAX_TIM3_PWM_VALUE)/MAX_ADC_12BIT_VALUE); to occur at the wrong time (much later than intended). In pathological cases (e.g. where some other event like disk or network just happens to occur at a similar related frequency - e.g. at half your "CCR update frequency") this can easily become completely unusable (e.g. because turning the output on is often delayed more than intended and turning the output off is not).
However...
All of this depends on how many samples per second (or better, how many CCR updates per second) you actually need. For some purposes (e.g. controlling an electric motor's speed in a system that changes the angle of a solar panel to track the position of the sun throughout the day); maybe you only need 1 sample per minute and all the problems caused by using CPU disappear. For other purposes (e.g. AM radio transmissions) DMA probably won't be good enough either.
WARNING
Unfortunately, I can't/didn't find any documentation for HAL_ADC_Start_DMA(), HAL_TIM_PWM_Start() or HAL_TIM_PWM_Start_DMA() online, and don't know what the parameters are or how the DMA is actually being used. When I first wrote this answer I simply relied on a "likely assumption" that may have been a false assumption.
Typically, for DMA you have a block of many pieces of data (e.g. for audio, maybe you have a block 176400 values - enough for a whole second of sound at "4 values per sample, 44100 samples per second"); and while that transfer is happening the CPU is free to do other work (and not wasted). For continuous operation, the CPU might prepare the next block of data while the DMA transfer is happening, and when the DMA transfer completes the hardware would generate an IRQ and the IRQ handler will start the next DMA transfer for the next block of values (alternatively, the DMA channel could be configured for "auto-repeat" and the block of data might be a circular buffer). In that way, the "88.2% of all CPU time wasted because you didn't use DMA" would be "almost zero CPU time used because DMA controller is doing almost everything"; and the whole thing would be immune to most timing problems (an IRQ or higher priority task preempting can not influence the DMA controller's timing).
This is what I assumed the code is doing when it uses DMA. Specifically, I assumed that the every "N nanoseconds" the DMA would take the next raw value from a large block of raw values and use that next raw value (representing the width of the pulse) to set a timer's threshold to a value from 0 to N nanoseconds.
In hindsight; it's possibly more likely that the code sets up the DMA transfer for "1 value per transfer, with continual auto-repeat". In that case the DMA controller would be continually pumping whatever value happens to be in RawPWMThresh to the timer at a (possibly high) frequency, and then the code in the while(1) loop would be changing the value in RawPWMThresh at a (possibly much lower) frequency. For example (continuing the audio example); it could be like doing "16 values per sample (via. the DMA controller), with 44100 samples per second (via. the while(1) loop)". In that case; if something (an unrelated IRQ, etc) causes an unexpected extra delay after the vTaskDelayUntil(); then it's not a huge catastrophe (the DMA controller simply repeats the existing value for a little longer).
If that is the case; then the real difference could be "X values per sample with 20 samples per second" (with DMA) vs. "1 value per sample with 20 samples per second" (without DMA); where the overhead is the same regardless, but the quality of the output is much better with DMA.
However; without knowing what the code actually does (e.g. without knowing the frequency of the DMA channel and how things like the timer's prescaler are configured) it's also technically possible that when using DMA the "X values per sample with 20 samples per second" is actually "1 value per sample with 20 samples per second" (with X == 1). In that case, using DMA would be almost pointless (none of the performance benefits I originally assumed; and almost none of the "output quality" benefits I'm tempted to assume in hindsight, except for the "repeat old value if there's unexpected extra delay after the vTaskDelayUntil()").

First, remember that premature optimization is the cause of uncountably many problems. The question you need to ask is "what ELSE does the processor need to do?". If the processor has nothing better to do, then just poll and save yourself some programming effort.
If the processor does have something better to do (or you are running from batteries and want to save power) then you need to time how long the processor spends waiting between each thing that it needs to do.
In your case, you are using an operating system context switch in place of "waiting". You can time the cost of the switch-write-to-pwm-switch-back cycle by measuring the performance of some other thread.
Set up a system with two threads. Perform some task that you know the performance of in one thread, eg, some fixed computation or processor benchmark. Now set up the other thread to do your timer business above. Measure the performance of the first thread.
Next set up a similar system with only the first thread plus DMA doing the PWM. Measure the performance change, you have you answer.
Obviously this all depends very much on your exact system. There is no general answer that can be given. The closer your test is to your real system the more accurate the answer you will get.
PS: Your PWM will glitch using the above code. Replace the two writes with a single one:
TIM3->CCR2 &= ~0xFFFF;
TIM3->CCR2 |= SomeValue;
should be:
TIM3->CCR2 = ((TIM3->CCR2 & ~0xFFFF) | SomeValue);

Related

Calculating pulse width (duty cycle) using stm32 DMA. Is it possible?

I'm working on a project that a series of duty cycles must be measured. A sample of related waveforms is displayed below:
As you can see from the signal, the frequency is too high, and calculating it using bit functions is impossible. In controller's tech website here, they used the timer's input capture modes and rising-falling edges interrupts for calculating the difference between two captures of the timer.But this method is too slow and cannot fulfill our desires for high-frequency signals.
The other solution is to use DMA for fast transferring the capture data to the memory. But in STM32cubemx it is not possible to assigned two DMAs for two captuers of a timer as you can see below:
Could some one give me a suggestion for this issue?

Using the DMA channel is not likely to provide a good solution for fast signals since the memory buss is shared between the DMA controller and CPU, so predictable timing over the capture event time is not guaranteed. Also, the timing relationship between the DMA transfers and the external signal will be difficult to resolve. So I'd say "no" to your question.
With 16-bit timers that can run up to 120 MHz, the STM32 featured timers is your best choice. An 800kHz signal is not considered too fast for these critters! The trick is how you make use of the timers. You want to use the input capture mode. Capture several samples of logic high signal times, then average these numbers, do the same for logic low signal times, then add for total timer ticks, then multiply this by the timer tick period for the external signal period.

How to determine MCU Clock speed requirements

Overview:
I spent a while trying to think of how to formulate this question. To narrow the scope, I wanted to provide my initial HW requirements in the form of a ‘real life’ example application.
I understand that clock speed is probably relative, in the sense that it is a case by case basis. For example, your requirement for a certain speed may be impacted on by the on-chip peripherals offered by the MCU. As an example, you may spend (n) cycles servicing an ISR for an encoder, or, you could pick an MCU that has a QEI input to do it for you (to some degree), which in turn, may loosen your requirement?
I am not an expert, and am very much still learning, so please call me out if I use an incorrect term, or completely misinterpret something. I assure you; the feedback is welcome!
Example Application:
This application is relatively simple. It can be thought of as a non-blocking state machine, where each ‘iteration’ of the machine must complete within 20ms. A single iteration of this machine has 4 main tasks:
Decode a serial payload, consisting of 32 bytes. The length is fixed at 32 bytes, payload is dynamic, baud is 115200bps (See Task #2 below)
Read 4 incremental shaft encoder signals, which are coupled with 4 DC Motors, 1 encoder for each motor (See Task #1 Below)
Determine the position of 4 limit switches. ISR driven, trigger on rising edge for each switch.
Based on the 3 categories of inputs above, the MCU will output 4 separate PWM signals # 50Hz (20ms) to a motor controller for its next set of movements. (See Task #3 below)
From an IO perspective, I know that the MCU is on the hook for reading 8 digital signals (4 quadrature encoders, 4 limit switches), and decoding a serial frame of 32 bytes over UART.
Based on that data, the MCU will output 4 independent PWM signals, with a pulse width of [1000usec -3200usec], per motor, to the motor controller.
The Question:
After all is said and done, I am trying to think through how I can map my requirements into MCU selection, solely from a speed point of view.
It’s easy for me to look through the datasheet and say, this chip meets my requirements because it has (n) UARTS, (n) ISR input pins, (n) PWM outputs etc. But my projects are so small that I always assume the processor is ‘fast enough’. Aside from my immediate peripheral needs, I never really look into the actual MCU speed, which is an issue on my end.
To resolve that, I am trying to understand what goes into selecting a particular clock speed, based on the needs of a given application. Or, another way to say it, which is probably wrong, but how to you quantify the theoretical load on the processor for that specific application?
Additional Information
Task #1: Encoder:
Each of the 4 motors have different tasks within the system, but regardless, they are the same brand/model motor, and have a maximum RPM of 230. My assumption is, if at its worst case, one of the motors is spinning at 230 RPM, that would mean, at full quadrature resolution (count rising/falling for channel A/B) the 1000PPR encoder would generate 4K interrupts per revolution. As such, the MCU would have to service those interrupts, potentially creating a bottleneck for the system. For example, if (n) number of clock cycles are required to service the ISR, and for 1 revolution of 1 motor, we expect 4K interrupts, that would be … 230(RPM) * 4K (ISR per rev) == 920,000 interrupts per minute? Yikes! And then I guess you could just extrapolate and say, again, at it’s worst case, where each of the 4 motors are spinning at 230 RPM, there’s a potential that, if the encoders are full resolution, the system would have to endure 920K interrupts per minute for each encoder. So 920K * 4 motors == 3,680,000 interrupts per minute? I am 100% sure I am doing something wrong, so please, feel free to set me straight.
Task #2: Serial Decoding
The MCU will require a dedicated HW serial port to decode a packet of 32 bytes, which repeats, with different values, every 7ms. Baud rate will be set to 115200bps.
Task #3: PWM Output
Based on the information from tasks 1 and 2, the MCU will write to 4 separate PWM outputs. The pulse for each output will be between 1000-3200usec with a frequency of 50Hz.

You need to separate real-time critical parts from the rest of the application. For example, the actual reception of an UART frame is somewhat time-critical if you do so interrupt-based. But the protocol decoding is not critical at all unless you are expected to respond within a certain time.
Decode a serial payload, consisting of 32 bytes.
You can either do this the old school way with interrupts filling up a buffer, or you could look for a part with DMA, which is fairly common nowadays. DMA means that you won't have to consider some annoying, relatively low frequency UART interrupt disrupting other tasks.
Read 4 incremental shaft encoder signals
I haven't worked with such encoders so I can't tell how time-critical they are. If you have to catch every single interrupt and your calculations are correct, then 3,680,000 interrupts per minute is still not that bad. 60*60/3680000 = 978us. So roughly one interrupt every millisecond, that's not a "hard real-time" requirement. If that's the only time-critical thing you need to do, then any shabby 8-bitter running at 8MHz could keep up.
Determine the position of 4 limit switches
You don't mention timing here but I assume this is something that could be polled cyclically by a low priority cyclic timer.
the MCU will output 4 separate PWM signals
Not a problem, just pick one with a decent PWM hardware peripheral. You should just need to update some PWM duty cycle registers now and then.
Overall, this doesn't sound all that real-time critical. I've done much worse real-time projects with icky 8 and 16 bitters. However, each time I did, I always regret not picking a faster MCU, because you always come up with stuff to add as the project/product goes on.
It sounds like your average mainstream Cortex M0+ would be a good candidate for this project. Clock it at ~48MHz and you'll have plenty of CPU power. Cortex M4 or larger if you actually expect floating point math (I don't quite see why you'd need that though).
Given the current component crisis, be careful with which brand you pick though! In particular stay clear of STM32, since ST can't produce them right now and you might end up waiting over a year until you get parts.

The answer to the question is "experience". But intuitively your example is not particularly taxing - although there are plenty of ways you could mess it up. I once worked on a project that ran on a 200MHz C5502 DSP at near 100% CPU load. The application now runs on a 72MHz Cortex-M3 at only 60% with additional functionality and I/O not present in the original implementation..
Your application is I/O bound; depending on data rates (and critically interrupt rates), I/O seldom constitutes the highest CPU load, and DMA, hardware FIFOs, input capture timer/counters, and hardware PWM etc. can be used to minimise the I/O impact. I shan't go into it in detail; #Lundin has already done that.
Note also that raw processor speed is important for data or signal processing and number crunching - but what I/O generally requires is deterministic real-time response, and that is seldom simply a matter of MHz or MIPS - you will get more deterministic and possibly faster response from an 8bit AVR running at a few MHz than you can guarantee from a 500MHz application processor running Linux - and it won't take 30 seconds to boot!

Unexpected timing in Arm assembly

I need a very precise timing, so I wrote some assembly code (for ARM M0+).
However, the timing is not what I expected when measuring on an oscilliscope.
#define LOOP_INSTRS_CNT 4 // subs: 1, cmp: 1, bne: 2 (when branching)
#define FREQ_MHZ (BOARD_BOOTCLOCKRUN_CORE_CLOCK / 1000000)
#define DELAY_US_TO_CYCLES(t_us) ((t_us * FREQ_MHZ + LOOP_INSTRS_CNT / 2) / LOOP_INSTRS_CNT)
static inline __attribute__((always_inline)) void timing_delayCycles(uint32_t loopCnt)
{
// note: not all instructions take one cycle, so in total we have 4 cycles in the loop, except for the last iteration.
__asm volatile(
".syntax unified \t\n" /* we need unified to use subs (not for sub, though) */
"0: \t\n"
"subs %[cyc], #1 \t\n" /* assume cycles > 0 */
"cmp %[cyc], #0 \t\n"
"bne.n 0b\t\n" /* this instruction costs 2 cycles when branching! */
: [cyc]"+r" (loopCnt) /* actually input, but we need a temporary register, so we use a dummy output so we can also write to the input register */
: /* input specified in output */
: /* no clobbers */
);
}
// delay test
#define WAIT_TEST_US 100
gpio_clear(PIN1);
timing_delayCycles(DELAY_US_TO_CYCLES(WAIT_TEST_US));
gpio_set(PIN1);
So pretty basic stuff. However, the delay (measured by setting a GPIO pin low, looping, then setting high again) timing is consistently 50% higher than expected. I tried for low values (1 us giving 1.56 us), up to 500 ms giving 750 ms.
I tried to single step, and the loop really does only the 3 steps: subs (1), cmp (1), branch (2). Paranthesis is number of expected clock cycles.
Can anybody shed light on what is going on here?

After some good suggestions I found the issue can be resolved in two ways:
Run core clock and flash clock at the same frequency (if code is running from flash)
Place the code in the SRAM to avoid flash access wait-states.
Note: If anybody copies the above code, note that you can delete the cmp, since the subs has the s flag set. If doing so, remember to set instruction count to 3 instead of 4. This will give you a better time resolution.

You can't use these processors like you would a PIC, the timing doesn't work like that. I have demonstrated this here many times you can look around, maybe will do it again here, but not right now.
First off these are pipelined so you average performance is one thing, but and once in a loop and things like caching and branch prediction learning and other factors have settled then you can get consistent performance, for that implementation. Ignore any documentation related to clocks per instruction for a pipelined processor now matter how shallow, that is the first problem in understanding why the timing doesn't work as expected.
Alignment plays a role and folks are tired of me beating this drum but I have demonstrated it so many times. You can search for fetch in the cortex-m0 TRM and you should immediately see that this will affect performance based on alignment. If the chip vendor has compiled the core for 16 bit only then that would be predictable or more predictable (ignoring other factors). But if they have compiled in the other features and if prefetching is happening as described, then the placement of the loop in the address space can affect the loop by plus or minus a fetch affecting the total time to complete the loop, which is measurable with or without a scope.
Branch prediction, which didn't show up in the arm docs as arm doing it but the chip vendors are fully free to do this.
Caching. While a cortex-m0+ if this is an STM32 or perhaps other brands as well, there is or may be a cache you can't turn off. Not uncommon for the flash to be half the speed of the processor thus flash wait state settings, but often the zero wait state means zero additional and it takes two clocks to get one fetch done or at least is measurable that execution in flash is half the speed of execution in ram with all other settings the same (system clock speed, etc). ST has a pretty good prefetch/caching solution with some trademarked name and perhaps a patent who knows. And rarely can you turn this off or defeat it so the first time through or the time entering the loop can see a delay and technically a pre-fetcher can slow down the loop (see alignment).
Flash, as mentioned depending on the chip vendor and age of the part it is quite common for the flash to be half speed of the core. And then depending on your clock rates, when you read about the flash settings in the chip doc where it shows what the required wait states were relative to system clock speed that is a key performance indicator both for the flash technology and whether or not you should really be raising the system clock up too high, the flash doesn't get any faster it has a speed limit, sram from my experience can keep up and so far I don't see them with wait states, but flashes used to be two or three settings across the range of clock speeds the part supports, the newer released parts the flashes are tending to cover the whole range for the slower cores like the m0+ but the m7 and such keep getting higher clock rates so you would still expect the vendors to need wait states.
Interrupts/exceptions. Are you running this on an rtos, are there interrupts going on are you increasing and/or guaranteeing that this gets interrupted with a longer delay?
Peripheral timing, the peripherals are not expected to respond to a load or store in a single clock they can take as long as they want and depending on the clocking system and chip vendors IP, in house or purchased, the peripheral might not run at the processor clock rate and be running at a divided rate making things slower. Your code no doubt is calling this function for a delay, and then outside this timing loop you are wiggling a gpio pin to see something on a scope which leads to how you conducted your benchmark and additional problems with that based on factors above and this one.
And other factors I have to remember.
Like high end processors like the x86, full sized ARMs, etc the processor no longer determines performance. The chip and motherboard can/do. You basically cannot feed the pipe constantly there are stalls going on all over the place. Dram is slow thus layers of caching trying to deal with it but caching helps sometimes and hurts others, branch predictors hurt as much as they help. And so on but it is heavily driven by the system outside the processor core as to how well you can feed the core, and then you get into the core's properties with respect to the pipeline and its own fetching strategy. Ideally using the width of the bus rather than the size of the instruction, transaction overhead so multiple widths of the bus is even more ideal that one width, etc.
Causing tight loops like this on any core to have a jerky motion and or be inconsistent in timing when the same machine code is used at different alignments. Now granted for size/power/etc the m0+ has a tiny pipe, but it still should show the affects of this. These are not pics or avrs or msp430s no reason to expect a timing loop to be consistent. At best you can use a timing loop for things like spi and i2c bit banging where you need to be greater than or equal to some time value, but if you need to be accurate or within a range, it is technically possible per implementation if you control many of the factors, but it is often not worth the effort and you have this maintenance issue now or readability or understandability of the code.
So bottom line there is no reason to expect consistent timing. If you happened to get consistent/linear timing, then great. The first thing you want to do is check that when you changed and re-built the code to use a different value for the loop that it didn't affect alignment of this loop.
You show a loop like this
loop:
subs r0,#1
cmp r0,#0
bne loop
on a tangent why the cmp, why not just
loop:
subs r0,#1
bne loop
But second you then claim to be measuring this on a scope, which is good because how you measure things plays into the quality of the benchmark often the benchmark varies because of how it is measured the yardstick is the problem not the thing being measured, or you have problems with both then the measurement is much more inconsistent. Had you used systick or other to measure this depending on how you did that the measurement itself can cause the variation, and even if you used gpio to toggle a pin that can and probably is affecting this as well. All other things held constant simply changing the loop count depending on the immediate and the value used could push you between a thumb and thumb2 instruction changing the alignment of some loop.
What you have shown implies you have this timing loop which can be affected by a number of system issues, then you have wrapped that with some other loop itself being affected, plus possibly a call to a gpio library function which can be affected by these factors as well from a performance perspective. Using inline assembly and the style in which you wrote this function that you posted implies you have exposed yourself and can easily see a wide range of performance differences in running what appears to be the same code, or even actually the code under test being the same machine code.
Unless this is a microchip PIC, not PIC32, or a very very short list of other specific brand and family of chips. Ignore the cycle counts per instruction, assume they are wrong, and don't try for accurate timing unless you control the factors.
Use the hardware, if for example you are trying to use the ws8212/neopixel leds and you have a tight window for timing you are not going to be successful or will have limited success using instruction timing. In that specific case you can sometimes get away with using a spi controller or timers in the part to generate accurately timed (far more than you can ever do with software timers managing the bit banging or otherwise). With a PIC I was able to generate tv infrared signals with the carrier frequency and ons and off using timed loops and nops to make a highly accurate signal. I repeated that for one of these programmable led things for a short number of them on a cortex-m using a long linear list of instructions and relying on execution performance it worked but was extremely limited as it was compile time and quick and dirty. SPI controllers are a pain compared to bit banging but another evening with the SPI controller and could send any length of highly accurately timed signals out.
You need to change your focus to using timers and/or on chip peripherals like uart, spi, i2c in non-normal ways to generate whatever signal this is you are trying to generate. Leave timed loops or even timer based loops wrapped by other loops for the greater than or equal cases and not for the within a range of time cases. If unable to do it with one chip, then look around at others, very often when making a product you have to shop for the components, across vendors, etc. Push comes to shove use a CPLD or a PAL or GAL or something like that to get highly accurate but custom timing. Depending on what you are doing and what your larger system picture looks like the ftdi usb chips with mpsse have a generic state machine that you can program to generate an array of signals, they do i2c, spi, jtag, swd etc with this generic programmable system. But if you don't have a usb host then that won't work.
You didn't specify a chip and I have a lot of different chips/boards handy but only a small fraction of what is out there so if I wanted to do a demo it might not be worth it, if mine has the core compiled one way I might not be able to get it to demonstrate a variation where the same exact core from arm compiled another way on another chip might be easy. I suspect first off a lot of your variation is because you are making calls within a bigger loop, call to delay call to state change the gpio, and you are recompiling that for the experiments. Or worse as shown in your question if you are doing a single pass and not a loop around the calls, then that can maximize the inconsistency.

STM32F407 timers with hall encoders

I'm a bit unsure what's the best approach to the problem given my knowledge of the STM32. I want to measure the speed and position of a motor with an integrated hall encoder of 6400 rising/falling edges per rotation, separated into two channels (one CH gives 3200 rising/falling edges).
What's the best way to do it?
The thing is... I have 4 motors to measure.
I considered many options, but I would like one that only generates interrupts when the position data is already known (basically, so I don't increment myself a position variable at each pulse but instead let a timer do it for me).
From what I know, a few timers support a mode called "Encoder mode". I don't know the details about this mode, but I would like (if possible) to be able to calculate my speed at a fixed amount of time (say around 20ms).
Is it possible in encoder mode, with one timer, to know both the rising/falling edges count (which I guess would be in the CNT register) and have it trigger an interrupt at each 20 ms, so that I can divide the CNT register by 20ms to get the count/sec speed within the ISR?
The other option I have is to count with Input Capture direct mode with two channels on each timer (one for each motor), and have another timer with a fixed period of 20ms, and calculate all the speeds of the 4 motors there. But it requires 5 timers...
If anything else, is there a way DMA could help to keep it to 4 timers? For example, can we count with DMA?
Thanks!

The encoder interface mode on the STM32F407 is supported on timers 1 & 8 (Advanced Control timers - 16 bit) and timers 2 to 5 (General purpose timers - 16/32 bit). Timers 9 to 14 (also General purpose) do not support quadrature encode input.
It is important that in this mode the timer is operating as a counter rather than a timer. The quadrature input allows up/down count depending on the direction, so that it will provide relative position.
Note that if your motor will only ever travel in one direction, you do not need the encoder mode, you can simply clock a timer from a single channel, although that will reduce the resolution significantly, so accuracy at low speeds may suffer.
To determine speed, you need to calculate change in relative position over time.
All ARM Cortex-M devices have a SYSTICK timer which will generate a periodic interrupt. You can use this to count time.
You then have two possibilities:
read the encoder counter periodically whereby the change in count is directly proportional to speed (because change in time will be a constant),
read the encoder aperiodically and calculate change in position over change in time
The reload value for the encoder interface mode is configurable, for this application (speed rather then position), you should set the to the maximum (0xffff or 0xffffffff) since it makes the arithmetic simpler as you won't have to deal with wrap-around (so long as it does not wrap-around twice between reads).
For the aperiodic method and assuming you are using timers 2 to 5 in 32 bit mode, the following pseudo-code will generate speed in RPM for example:
int speedRPM_Aperiodic( int timer_id )
{
int rpm = 0 ;
static struct
{
uint32_t count ;
uint32_t time ;
} previous[] = {{0,0},{0,0},{0,0},{0,0}} ;
if( timer_id < sizeof(previous) / sizeof(*previous) )
{
uint32_t current_count = getEncoderCount( timer_id ) ;
int delta_count = previous[timer_id].count - current_count ;
previous[timer_id].count = current_count ;
uint32_t current_time = getTick() ;
int delta_time = previous[timer_id].time - current_time ;
previous[timer_id].time = current_time ;
rpm = (TICKS_PER_MINUTE * delta_count) /
(delta_time * COUNTS_PER_REVOLUTION) ;
}
return rpm ;
}
The function needs to be called often enough that the count does not wrap-around more than once, and not so fast that the count is too small for accurate measurement.
This can be adapted for a periodic method where delta_time is fixed and very accurate (such as from the timer interrupt or a timer handler):
int speedRPM_Periodic( int timer_id )
{
int rpm = 0 ;
uint32_t previous_count[] = {0,0,0,0} ;
if( timer_id < sizeof(previous_count) / sizeof(*previous_count) )
{
uint32_t current_count = getEncoderCount( timer_id ) ;
int delta_count = previous[timer_id].count - current_count ;
previous_count[timer_id] = current_count ;
rpm = (TICKS_PER_MINUTE * delta_count) /
(SPEED_UPDATE_TICKS * COUNTS_PER_REVOLUTION) ;
}
return rpm ;
}
This function must then be called exactly every SPEED_UPDATE_TICKS.
The aperiodic method is perhaps simpler to implement, and is good for applications where you want to know the mean speed over the elapsed period. Suitable for example a human readable display that might be updated relatively slowly.
The periodic method is better suited to speed control applications where you are using a feed-back loop to control the speed of the motor. You will get poor control if the feedback timing is not constant.
The aperiodic function could of course be called periodically, but has unnecessary overhead where delta time is deterministic.

A timer can count on one type of event
It can count either on some external signal like your sensors, or on a clock signal, but not on both of them at once. If you want to do something in every 20ms, you need something that counts on a stable clock source.
DMA can of course count the transfers it's doing, but to make it do something at every 20 ms, it has to be triggered at fixed time intervals by something.
Therefore, you need a fifth timer.
Fortunately, there are lots of timers to choose from
10 more timers
The F407 has 14 hardware timers. You don't want to use more than 4 of them, I'm assuming 10 of them are used elsewhere in your application. Check the usage of them. Perhaps there is one that counts on a suitable clock frequency, and can generate an interrupt with a frequency that is suitable for sampling your encoders.
SysTick timer
Cortex-M cores have an internal timer called SysTick. Many applications use it to generate an interrupt at every 1ms for timekeeping and other periodic tasks. If that's the case, you can read the encoder values in every 20th SysTick interrupt - this has the advantage of not requiring additional interrupt entry/exit overhead. Otherwise you can set it up directly to generate an interrupt at every 20ms. Note that you won't find SysTick in the Reference Manual, it's documented in the STM32F4 Programmers Manual.
Real-Time clock
The RTC has a periodic auto-wakeup function that can generate an interrupt every 20 ms.
UART
Unless you are using all 6 UARTS, you can set one of them to a really slow baud rate like 1000 baud, and keep transmitting dummy data (you don't have to assign a physical pin to it). Transmitting at 1000 baud, with 8 bit, one start and one stop bit gives an interrupt at every 10 ms. (It won't let you go down to 500 baud unless your APB frequency is lower than the maximum allowed)

Long Delay using Delay Functions from C18 Libraries for PIC18

I'm using a PIC18 with Fosc = 10MHz. So if I use Delay10KTCYx(250), I get 10,000 x 250 x 4 x (1/10e6) = 1 second.
How do I use the delay functions in the C18 for very long delays, say 20 seconds? I was thinking of just using twenty lines of Delay10KTCYx(250). Is there another more efficient and elegant way?
Thanks in advance!

It is strongly recommended that you avoid using the built-in delay functions such as Delay10KTCYx()
Why you might ask?
These delay functions are very inaccurate, and they may cause your code to be compiled in unexpected ways. Here's one such example where using the Delay10KTCYx() function can cause problems.
Let's say that you have a PIC18 microprocessor that has only two hardware timer interrupts. (Usually they have more but let's just say there are only two).
Now let's say you manually set up the first hardware timer interrupt to blink once per second exactly, to drive a heartbeat monitor LED. And let's say you set up the second hardware timer interrupt to interrupt every 50 milliseconds because you want to take some sort of digital or analog reading at exactly 50 milliseconds.
Now, lastly, let's say that in your main program you want to delay 100,000 clock cycles. So you put a call to Delay10KTCYx(10) in your main program. What happenes do you suppose? How does the PIC18 magically count off 100,000 clock cycles?
One of two things will happen. It may "hijack" one of your other hardware timer interrupts to get exactly 100,000 clock cycles. This would either cause your heartbeat sensor to not clock at exactly 1 second, or, cause your digital or analog readings to happen at some time other than every 50 milliseconds.
Or, the delay function will just call a bunch of Nop() and claim that 1 Nop() = 1 clock cycle. What isn't accounted for is "overheads" within the Delay10KTCYx(10) function itself. It has to increment a counter to keep track of things, and surely it takes more than 1 clock cycle to increment the timer. As the Delay10KTCYx(10) loops around and around it is just not capable of giving you exactly 100,000 clock cycles. Depending on a lot of factors you may get way more, or way less, clock cycles than you expected.
The Delay10KTCYx(10) should only be used if you need an "approximate" amount of time. And pre-canned delay functions shouldn't be used if you are already using the hardware timer interrupts for other purposes. The compiler may not even successfully compile when using Delay10KTCYx(10) for very long delays.
I would highly recommend that you set up one of your timer interrupts to interrupt your hardware at a known interval. Say 50,000 clock cycles. Then, each time the hardware interrupts, within your ISR code for that timer interrupt, increment a counter and reset the timer over again to 0 cycles. When enough 50,000 clock cycles have expired to equal 20 seconds (or in other words in your example, 200 timer interrupts at 50,000 cycles per interrupt), reset your counter. Basically my advice is that you should always manually handle time in a PIC and not rely on pre-canned Delay functions - rather build your own delay functions that integrate into the hardware timer of the chip. Yes, it's going to be extra work - "but why can't I just use this easy and nifty built-in delay function, why would they even put it there if it's gonna muck up my program?" - but this should become second nature. Just like you should be manually configuring EVERY SINGLE REGISTER in your PIC18 upon boot-up, whether you are using it or not, to prevent unexpected things from happening.
You'll get way more accurate timing - and way more predictable behavior from your PIC18. Using pre-canned Delay functions is a recipe for disaster... it may work... it may work on several projects... but sooner or later your code will go all buggy on you and you'll be left wondering why and I guarantee the culprit will be the pre-canned delay function.

To create very long time use an internal timer. This can helpful to avoid block in your application and you can check the running time. Please refer to PIC data sheet on how to setup a timer and its interrupt.
If you want a very high precision 1S time I suggest also to consider an external RTC device or an internal RTC if the micro has one.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight