Why is Generic Interrupt Controller in ARM Cortex A15 Multi-Core processor frequency less than main core frequency? - arm

I was studying about Generic Interrupt Controllers(GIC) in ARM and when I read about the Clock frequency for GIC, it stated that a GIC has a clock frequency that is an integer multiple less than that of Main core clock frequency. Can somebody explain why is it so ?

Highspeed circuits are a LOT harder to design, manufacture and control. So, if speed is not that important (As it is the case with your GIC), the peripherals of the core will usually run a lot slower than the core itself. Even L2-Cache usually does not run with full core speed.
Also, the gain of having a faster clocked GIC is probably negligible, so there is no reason for the designers to do a new generation, which in this business always is an expensive and risky adventure.

Related

How to determine MCU Clock speed requirements

Overview:
I spent a while trying to think of how to formulate this question. To narrow the scope, I wanted to provide my initial HW requirements in the form of a ‘real life’ example application.
I understand that clock speed is probably relative, in the sense that it is a case by case basis. For example, your requirement for a certain speed may be impacted on by the on-chip peripherals offered by the MCU. As an example, you may spend (n) cycles servicing an ISR for an encoder, or, you could pick an MCU that has a QEI input to do it for you (to some degree), which in turn, may loosen your requirement?
I am not an expert, and am very much still learning, so please call me out if I use an incorrect term, or completely misinterpret something. I assure you; the feedback is welcome!
Example Application:
This application is relatively simple. It can be thought of as a non-blocking state machine, where each ‘iteration’ of the machine must complete within 20ms. A single iteration of this machine has 4 main tasks:
Decode a serial payload, consisting of 32 bytes. The length is fixed at 32 bytes, payload is dynamic, baud is 115200bps (See Task #2 below)
Read 4 incremental shaft encoder signals, which are coupled with 4 DC Motors, 1 encoder for each motor (See Task #1 Below)
Determine the position of 4 limit switches. ISR driven, trigger on rising edge for each switch.
Based on the 3 categories of inputs above, the MCU will output 4 separate PWM signals # 50Hz (20ms) to a motor controller for its next set of movements. (See Task #3 below)
From an IO perspective, I know that the MCU is on the hook for reading 8 digital signals (4 quadrature encoders, 4 limit switches), and decoding a serial frame of 32 bytes over UART.
Based on that data, the MCU will output 4 independent PWM signals, with a pulse width of [1000usec -3200usec], per motor, to the motor controller.
The Question:
After all is said and done, I am trying to think through how I can map my requirements into MCU selection, solely from a speed point of view.
It’s easy for me to look through the datasheet and say, this chip meets my requirements because it has (n) UARTS, (n) ISR input pins, (n) PWM outputs etc. But my projects are so small that I always assume the processor is ‘fast enough’. Aside from my immediate peripheral needs, I never really look into the actual MCU speed, which is an issue on my end.
To resolve that, I am trying to understand what goes into selecting a particular clock speed, based on the needs of a given application. Or, another way to say it, which is probably wrong, but how to you quantify the theoretical load on the processor for that specific application?
Additional Information
Task #1: Encoder:
Each of the 4 motors have different tasks within the system, but regardless, they are the same brand/model motor, and have a maximum RPM of 230. My assumption is, if at its worst case, one of the motors is spinning at 230 RPM, that would mean, at full quadrature resolution (count rising/falling for channel A/B) the 1000PPR encoder would generate 4K interrupts per revolution. As such, the MCU would have to service those interrupts, potentially creating a bottleneck for the system. For example, if (n) number of clock cycles are required to service the ISR, and for 1 revolution of 1 motor, we expect 4K interrupts, that would be … 230(RPM) * 4K (ISR per rev) == 920,000 interrupts per minute? Yikes! And then I guess you could just extrapolate and say, again, at it’s worst case, where each of the 4 motors are spinning at 230 RPM, there’s a potential that, if the encoders are full resolution, the system would have to endure 920K interrupts per minute for each encoder. So 920K * 4 motors == 3,680,000 interrupts per minute? I am 100% sure I am doing something wrong, so please, feel free to set me straight.
Task #2: Serial Decoding
The MCU will require a dedicated HW serial port to decode a packet of 32 bytes, which repeats, with different values, every 7ms. Baud rate will be set to 115200bps.
Task #3: PWM Output
Based on the information from tasks 1 and 2, the MCU will write to 4 separate PWM outputs. The pulse for each output will be between 1000-3200usec with a frequency of 50Hz.
You need to separate real-time critical parts from the rest of the application. For example, the actual reception of an UART frame is somewhat time-critical if you do so interrupt-based. But the protocol decoding is not critical at all unless you are expected to respond within a certain time.
Decode a serial payload, consisting of 32 bytes.
You can either do this the old school way with interrupts filling up a buffer, or you could look for a part with DMA, which is fairly common nowadays. DMA means that you won't have to consider some annoying, relatively low frequency UART interrupt disrupting other tasks.
Read 4 incremental shaft encoder signals
I haven't worked with such encoders so I can't tell how time-critical they are. If you have to catch every single interrupt and your calculations are correct, then 3,680,000 interrupts per minute is still not that bad. 60*60/3680000 = 978us. So roughly one interrupt every millisecond, that's not a "hard real-time" requirement. If that's the only time-critical thing you need to do, then any shabby 8-bitter running at 8MHz could keep up.
Determine the position of 4 limit switches
You don't mention timing here but I assume this is something that could be polled cyclically by a low priority cyclic timer.
the MCU will output 4 separate PWM signals
Not a problem, just pick one with a decent PWM hardware peripheral. You should just need to update some PWM duty cycle registers now and then.
Overall, this doesn't sound all that real-time critical. I've done much worse real-time projects with icky 8 and 16 bitters. However, each time I did, I always regret not picking a faster MCU, because you always come up with stuff to add as the project/product goes on.
It sounds like your average mainstream Cortex M0+ would be a good candidate for this project. Clock it at ~48MHz and you'll have plenty of CPU power. Cortex M4 or larger if you actually expect floating point math (I don't quite see why you'd need that though).
Given the current component crisis, be careful with which brand you pick though! In particular stay clear of STM32, since ST can't produce them right now and you might end up waiting over a year until you get parts.
The answer to the question is "experience". But intuitively your example is not particularly taxing - although there are plenty of ways you could mess it up. I once worked on a project that ran on a 200MHz C5502 DSP at near 100% CPU load. The application now runs on a 72MHz Cortex-M3 at only 60% with additional functionality and I/O not present in the original implementation..
Your application is I/O bound; depending on data rates (and critically interrupt rates), I/O seldom constitutes the highest CPU load, and DMA, hardware FIFOs, input capture timer/counters, and hardware PWM etc. can be used to minimise the I/O impact. I shan't go into it in detail; #Lundin has already done that.
Note also that raw processor speed is important for data or signal processing and number crunching - but what I/O generally requires is deterministic real-time response, and that is seldom simply a matter of MHz or MIPS - you will get more deterministic and possibly faster response from an 8bit AVR running at a few MHz than you can guarantee from a 500MHz application processor running Linux - and it won't take 30 seconds to boot!

Timer supply to CPU in QEMU

I am trying to emulate the clock control for STM32 machine with CPU cortex m4. It is provided in the STM32 reference manual the clock supplied to the core is by the HCLK.
The RCC feeds the external clock of the Cortex System Timer (SysTick) with the AHB clock (HCLK) divided by 8. The SysTick can work either with this clock or with the Cortex clock (HCLK), configurable in the SysTick control and status register.
Now Cortex m4 is already emulated by QEMU and I am using the same for STM32 emulation. My confusion is should i supply the clock frequency of "HCLK" I have developed for STM32 to send clock pulses to cortex m4 or cortex -m4 itself manages to have its own clock with HCLK clock frequency 168MHz? or the clock frequency is different ?
If I have to pass this frequency to cortex m4, how do i do that?
QEMU's emulation does not generally try to emulate actual clock lines which send pulses at megahertz rates (this would be incredibly inefficient). Instead when the guest programs a timer device the model of the timer device sets up an internal QEMU timer to fire after the appropriate duration (and the handler for that then raises the interrupt line or does whatever is necessary for emulating the hardware behaviour). The duration is calculated from the values the guest has written to the device registers together with a value for what the clock frequency should be.
QEMU doesn't have any infrastructure for handling things like programmable clock dividers or a "clock tree" that routes clock signals around the SoC (one could be added, but nobody has got around to it yet). Instead timer devices are usually either written with a hard-coded frequency, or may be written to have a QOM property that allows the frequency to be set by the board or SoC model code that creates them.
In particular for the SysTick device in the Cortex-M models the current implementation will program the QEMU timer it uses with durations corresponding to a frequency of:
1MHz, if the guest has set the CLKSOURCE bit to 1 (processor clock)
something which the board model has configured via the 'system_clock_scale' global variable (eg 25MHz for the mps2 boards), if the guest has set CLKSOURCE to 0 (external reference clock)
(The system_clock_scale global should be set to NANOSECONDS_PER_SECOND / clk_frq_in_hz.)
The 1MHz is just a silly hardcoded value that nobody has yet bothered to improve upon, because we haven't run into guest code that cares yet. The system_clock_scale global is clunky but works.
None of this affects the speed of the emulated QEMU CPU (ie how many instructions it executes in a given time period). By default QEMU CPUs will run "as fast as possible". You can use the -icount option to specify that you want the CPU to run at a particular rate relative to real time, which sort of implicitly sets the 'cpu frequency', but this will only sort of roughly set an average -- some instructions will run much faster than others, in a not very predictable way. In general QEMU's philosophy is "run guest code as fast as we can", and we don't make any attempt at anything approaching cycle-accurate or otherwise tightly timed emulation.
Update as of 2020: QEMU now has some API and infrastructure for modelling clock trees, which is documented in docs/devel/clocks.rst in the source tree. This is basically a formalized version of the concepts described above, to make it easier for one device to tell another "my clock rate is 20MHz now" without hacks like the "system_clock_scale" global variable or ad-hoc QOM properties.
Systick is supplied via multiplexer and you can choose the AHB bus clock or divided by 8 system timer clock
An old thread and an oft asked question so this should help some of you trying to emulate cortex systems.
If using a .dtb when booting then in your .dts one can add to the 'timers' block a line of clock-frequency = <value>; and recompile it. This will indeed increase the speed of cortex processors. Clearly, value is some large number.

Which MCU(Cortex-M) for time critical GPIO application?

We have an application which runs on PIC24H, we would like to port it to another MCU, preferably ARM Cortex. Application is extremely time critical, meaning that we need extremely deterministic code behaviour. In short, there are pulses which are obtained via special hardware to GPIO pins, data is analyzed right away. Processing of data is not complex(we don't need a beefy cpu/mcu to do it). After analyzing the data GPIO output pins are written to their values.
App in 3 short lines:
process input pins
determine pattern within processing of input pins
based on the received pattern write output pins
PIC24H is working at 40MHz, we can toggle the pin in 25ns, we would be grateful with at least 2x speed for future upgrades. So MCU which can run deterministic code and toggle pins with at least 80MHz (12.5ns) would be just fine. We don't need toggling of the pins at constant fast rate, we need a mcu which can toggle it in less than 25ns. We can't waste cycles while toggling, if one cycle is off we loose synchronization. Everything must be done in one cycle precision(or two but constant two cycles), so code should be 100% deterministic.
Please let me know if I'm missing something or if what we need can be done using some other methods on Cortex-M. Just keep in mind that if one cycle is lost(due cache or similar) we loose signal sync and app will not do it's work right or at all.
Thanks!
Br
According to this blog post, the interrupt latency for Cortex-M ranges from 12 to 16 cycles (assuming you are not using FPU registers) with best-case memories. M0 and M0+ are slower than M3/M4/M7. On top of this, you need to add the GPIO access times (and watch out for different clock frequencies between the core and the peripherals. Cortex-M7 will suppport higher clock speeds than M3/M4.
It still isn't clear how many cycles are consumed in recognising a pattern, and how an interrupt is useful in doing this - generally a low latency interface function like this would be an obvious target for dedicated hardware, but since you have an existing software solution it seems the problem is mis-specified.
Providing you avoid accessing any 'slow' peripherals which might stall the bus, the interrupt latency should be deterministic - any specific device should have documentation which covers this.
NXP have an application note which describes some of the detail of how to measure what is going on.

Embedded Programming, Wait for 12.5 us

I'm programming on the C2000 F28069 Experimenters Kit. I'm toggling a GPIO output every 12.5 microseconds 5 times in a row. I decided I don't want to use interrupts (though I will if I absolutely have to). I want to just wait that amount of times in terms of clock cycles.
My clock is running at 80MHz, so 12.5 us should be 1000 clock cycles. When I use a loop:
for(i=0;i<1000;i++)
I get a result that is way too long (not 12.5 us). What other techniques can I use?
Is sleep(n); something that I can use on a microcontroller? If so, which header file do I need to download and where can I find it? Also, now that I think about it, sleep(n); takes an int input, so that wouldn't even work... any other ideas?
Summary: Use the PWM or Timer peripherals to generate output pulses.
First, the clock speed of the CPU has a complex relationship to actual code execution speed, and in many CPUs there is more than one clock rate involved in different stages of the execution. The chip you reference has several internal clock sources, for instance. Further, each individual instruction will likely take a different number of clocks to execute, and some cores can execute part of (or all of) several instructions simultaneously.
To rigorously create a loop that required 12.5 µs to execute without using a timing interrupt or other hardware device would require careful hand coding in assembly language along with careful accounting of the execution time of each instruction.
But you are writing in C, not assembler.
So the first question you have to ask is what machine code was actually generated for your loop. And the second question is did you enable the optimizer, and to what level.
As written, a decent optimizer will determine that the loop for (i=0; i<1000; i++) ; has no visible side effects, and therefore is just a slow way of writing ;, and can be completely removed.
If it does compile the loop, it could be written naively using perhaps as many as 5 instructions, or as few as one or two. I am not personally familiar with this particular TI CPU architecture, so I won't attempt to guess at the best possible implementation.
All that said, learning about the CPU architecture and its efficiency is important to building reliable and efficient embedded systems. But given that the chip has peripheral devices built-in that provide hardware support for PWM (pulse width modulated) outputs as well as general purpose hardware timer/counters you would be far better off learning to use the hardware to generate the waveform for you.
I would start by collecting every document available on the CPU core and its peripherals, especially app notes and sample code.
The C compiler will have an option to emit and preserve an assembly language source file. I would use that as a guide to study the structure of the code generated for critical loops and other bottlenecks, as well as the effects of the compiler's various optimization levels.
The tool suite should have a mechanism for profiling your running code. Before embarking on heroic measures in pursuit of optimizations, use that first to identify the actual bottlenecks. Even if it lacks decent profiling, you are likely to have spare GPIO pins that can be toggled around critical sections of code and measured with a logic analyzer or oscilloscope.
The chip you refer has PWM (pulse width modulation) hardware declared as one of major winning features. You should rely on this. Please refer to appropriate application guide. Generally you cannot guarantee 12.5uS periods from application layer (and should not try to do so). Even if you managed to do so directly from application layer it's bad idea. Any change in your firmware code can break this.
If you use a timer peripheral with PWM output capability as suggested by #RBerteig already, then you can generate an accurate timing signal with zero software overhead. If you need to do other work synchronously with the clock, then you can use the timer interrupt to trigger that too. However if you process interrupts at an interval of 12.5us you may find that your processor spends a great deal of time context switching rather than performing useful work.
If you simply want an accurate delay, then you should still use a hardware timer and poll its reload flag rather than process its interrupt. This allows consistent timing independent of the compiler's code generation or processor speed and allows you to add other code within the loop without extending the total loop time. You would poll it in a loop during which you might do other work as well. The timing jitter and determinism will depend on what other work you do in the loop, but for an empty loop, reaction to the timer even will probably be faster than the latency on an interrupt handler.

ARM926EJ-S cycle-counter

Im using an ARM926EJ-S and am trying to figure out whether the ARM can give (e.g. a readable register) the CPU's cycle-counter. I guess a # that will represent the number of cycles since the CPU has been powered.
In my system i have only Low-Res external RTC/Timers. I would like to be able to achieve a Hi-Res timer.
Many thanks in advance!
You probably have only two choices:
Use an instruction-cycle accurate simulator; the problem here is that effectively simulating peripherals and external stimulus can be complex or impossible.
Use a peripheral hardware timer. In most cases you will not be able to run such a timer at the typical core clock rate of an ARM9, and there will be an over head in servicing the timer either side of the period being timed, but it can be used to give execution time over larger or longer running sections of code, which may be of more practical use than cycle count.
While cycle count may be somewhat scalable to different clock rates, it remains constrained by memory and I/O wait states, so is perhaps not as useful as it may seem as a performance metric, except at the micro-level of analysis, and larger performance gains are typically to be had by taking a wider view.
The arm-9 is not equipped with an PMU (Performance Monitoring Unit) as included in the Cortex-family. The PMU is described here. The linux kernel comes equipped with support for using the PMU for benchmarking performance. See here for documentation of the perf tool-set.
Bit unsure about the arm-9, need to dig a bit more...

Resources