Does a pipeline stall occur on an ARM to Thumb switch?

Does a pipeline stall occur on an ARM to Thumb switch? - arm

In ARM architecture, if an ARM to Thumb mode switch occurs, will a pipeline stall occur?
If so, how many cycles are affected?
Is this same for Thumb to ARM mode switching ?
Does this behavior vary with different ARM processors ?

Assuming you switch in the sensible way (with BLX/BX LR), any modern core will predict that (assuming the branch predictor isn't turned off, of course). Writing to the PC directly is a little more variable - generally, big cores might predict it but little cores won't - but is generally best avoided anyway.
Otherwise an interworking branch is AFAIK no different to a regular branch, so if it isn't predicted the penalty is just a pipeline flush. The only other way to switch instruction sets is via an exception return, which is a synchronising operation for the whole core (i.e. not the place to be worrying about performance).

No, not at all.
The cost is just like any other branch instruction. If the predictor hits, it's free, if not, it costs the usual 13 cycles.
There's no additional hidden hiccups due to the switching.
Therefore, you can use the interworking mode without worrying about potential penalties related to the mode switching.

Related

how instructions are fetched in cortex M processors

As far as I understand, Cortex M0/M3 processors have only one memory space that holds instructions and data and that the access is only through the memory bus interface. Thus, if I understand correctly, every clock cycle the processor must read a new instruction to enter the pipeline, but that means that the bus will always be busy reading instructions so how data can be read simultaneously (for load word/store word instructions for example)?
Additionally, what is the latency of reading an instruction from the memory? since if it is not a single cycle, then the processor must constantly halt itself until the next instruction is fetched, so how it is handled?
Thanks

Yes this is how it happens, the processor stalls a lot, this goes on with big processors as well as small, difficult at best to feed a pipelined processor (although some of these pipes are shallow on cortex-ms, but pipelined nevertheless).
Many of the parts I have used and I have touched most of the vendors, the flash is at half clock speed of the core, so even at zero wait states you can only get an instruction every other clock (on average naturally with overhead rolled in) if fetching a halfword at a time, if fetching a word at a time which many of the cores offer then that is ideally two instructions per two clocks or one per. thumb2 of course you take the hit. ST definitely has a prefetcher/cacher thing with a fancy marketing name that does a pretty good job. Others may offer that as well or just rely on what arm offers which varies.
The different cortex-ms have different mixtures of busses. I hate the von-Neumann/Harvard references as there is little practical use for an actual Harvard architecture, thus the "modified" adjective which means they can do anything and try to attract folks taught in school that Harvard means performance. The busses can have multiple transactions in flight and there are a different number of busses as is somewhat obvious when you go in and release clocks for a peripheral, apb1 clock control ahb2 clock control, etc. Peripherals, flash, etc. But we can run code from sram, so it's not Harvard. Forget Harvard and von-Neumann terms and just focus on the actual implementation.
The bus documentation is as readily available as the core documentation. If you buy the right fpga board you can request a free eval of a core which you can then get an up close and personal view as to how it really works.
End of the day there is some parallelism, but on many chips the flash is half speed so if you are not fetching two per or have some other solution you are barely making it and stalling often if you have other same bus accesses. Likewise on many of these chips the peripherals can't run as fast as the core, so that alone incurs a stall, but even if the peripheral runs on the same clock doesn't mean it turns around a csr or data access as fast as sram, so you incur a stall there too.
There is no reason to assume you will get one instruction per clock performance out of these parts any more than a full sized arm or x86 or other.
While there are some important details that are not documented and only seen when you get the core there is documentation on each core and bus to get a rough idea if how to tune your code to perform better or tune your expectations of how it will really perform. I know I have demonstrated this here and elsewhere, it is pretty easy even with an ST to see a performance difference between flash and sram and see that it takes more clocks than instructions to perform a benchmark.
Your question is too broad in a few ways. The cortex-m0 and m3 are quite different, one was the first one out and dripping with features, the other was tuned for size and has just less stuff in general not meant to necessarily compete in this way. Then how long is the latency, etc, that is strictly chip company and family within the chip company so that questions extremely broad all the cortex-m products out there, dozens of different answers to that question. ARM makes cores not chips, the chip vendors make chips and buy IP from various places and make some of their own, some small part of that chip might be some IP they buy from a processor vendor.

What you've described is known as the "von Neumann bottleneck", and in machines with a pure von Neumann architecture with shared data and program memory, accesses are usually interleaved. However you might like to check out the "modified Harvard architecture", because that's basically what is in use here. The backing store is shared like in a von Neumann machine, but the instruction and data fetch paths are separate like in a Harvard machine and crucially they have separate caches. So if an instruction or data fetch results in a cache hit, a memory fetch doesn't takes place and there is no bottleneck.
The second part of your question doesn't make a great deal of sense I'm afraid, because it is meaningless to talk about instruction fetch times in terms of instruction cycles. By definition, if an instruction fetch is delayed for some reason, the execution of that instruction (and subsequent instructions) must be delayed. Typically this is done by inserting NOPs into the pipeline until the next instruction is ready (known as "bubbling" the pipeline).

re: part 2: Instruction fetch can be pipelined to hide some / all of the fetch latency. Cortex-M3 has a prefetch unit with a 3-word FIFO.
https://developer.arm.com/documentation/ddi0337/e/introduction/prefetch-unit (This can hold up to six 16-bit Thumb instructions.)
This buffer can also supply instructions while data load/store is happening, in a config where those compete with each other (not Harvard split bus, and without a data or instruction cache).
This prefetch is of course speculative; discarded on branches. (It's simple and small enough not to be worth doing branch prediction to try to fetch from the right place before decode even knows the upcoming instruction stream contains a branch.)

Is L2 Cache Miss equivalent to "L2 Data Cache Refill" on ARMv7 A15?

I am trying to determine which hardware counters available on the ARM Cortex A15 processor are the correct ones to use for determining system-wide L2 cache misses.
My application here is a kernel-level voltage-frequency governor (i.e. it could substitute for the ondemand governor). Because I need access to performance counters at the system level and not attached to a particular program runtime, I am not using existing utilities such as PAPI or Linux's perf tool. From my past experiences with both I understand that these are better used to monitor the performance stats for a particular program or instrumented binary.
I have implemented a kernel module that periodically updates several hardware counter values to sysfs endpoints. The resources I have used include:
Performance Counter Sampling on the Exynos 5422
Arm Technical Reference Manual (Specifically Ch. 11 on the Performance Monitoring Unit)
Searches in perf and PAPI code & documentation to see if L2 misses is a derived counter rather than a native one.
The hardware counter I am currently using to measure L2 misses is event 0x17: "L2 data cache refill". Printing this value consistently gives 0, even when running data-heavy benchmarks. Is there a different event or set of events I should be using to determine L2 cache misses? Perhaps 0x13, "Data memory accesses", or some composite of events?
It is very possible that the root of my question is a misunderstanding of "L2 data cache refills", but I have not been able to find a clarification on this through documentation and stack overflow searches.
EDIT: I have found that L2 refills was reading 0 because the 5th hardware counter for some reason is not working as expected; re-assigning L2 refills to a different counter has resolved this particular issue.
EDIT 2: That 5th hardware counter was not function because I had not enabled that many. Silly me.

Quote from a different ARM manual implies refills are not just misses:
L1 data cache refill:
"This event counts all allocations into the L1 cache. This includes read linefills, store linefills, and prefetch linefills."

Is it possible the to lock the ISR instructions to L1 cache?

I am running a bare metal application on one of the cores of ARM cortex A9 processor. My ISR is quite small an I am wondering whether it would be possible to lock my ISR instructions in the L1 cache? Is it possible? Is there any one who would explain some drawbacks of doing it?
Regards,
N

The Cortex-A9 does not support L1 cache lockdown (neither instructions nor data).
The drawback is that taking large chunks of the cache away (lockdown is usually done on a granularity of entire cache ways) decreases performance for everything else in the system.
Not to mention the fact that if your ISR is indeed small, and it is called frequently, it is somewhat likely to be in the cache anyway.
What is the benefit you were expecting to gain from doing this?

Your condition is the perfect fit for fast interrupt. (FIQ)
You only have to assign the last interrupt number for that particular ISR.
While other interrupt numbers are just vectors, the last number branches directly to the code area, thus saving one memory load plus interlock. You save about three cycles or so.
Besides, i-cache lockdown isn't as efficient as d-cache lockdown.
CA9 doesn't support L1 cache lockdown anyway (for some good reasons), so don't bother.
Just make sure the ISR is cache line aligned for maximum efficiency. (typically 32 or 64byte)

Can we optimize code to reduce power consumption?

Are there any techniques to optimize code in order to ensure lesser power consumption.Architecture is ARM.language is C

From the ARM technical reference site:
The features of the ARM11 MPCore
processor that improve energy
efficiency include:
accurate branch and sub-routine return prediction, reducing the number
of incorrect instruction fetch and
decode operations
use of physically addressed caches, which reduces the number of cache
flushes and refills, saving energy in
the system
the use of MicroTLBs reduces the power consumed in translation and
protection lookups each cycle
the caches use sequential access information to reduce the number of
accesses to the tag RAMs and to
unwanted data RAMs.
In the ARM11 MPCore processor
extensive use is also made of gated
clocks and gates to disable inputs to
unused functional blocks. Only the
logic actively in use to perform a
calculation consumes any dynamic
power.
Based on this information, I'd say that the processor does a lot of work for you to save power. Any power wastage would come from poorly written code that does more processing than necessary, which you wouldn't want anyway. If you're looking to save power, the overall design of your application will have more effect. Network access, screen rendering, and other power-hungry operations will be of more concern for power consumption.

Optimizing code to use less power is, effectively, just optimizing code. Regardless of whether your motives are monetary, social, politital or the like, fewer CPU cycles = less energy used. What I'm trying to say is I think you can probably replace "power consumption" with "execution time", as they would, essentially, be directly proportional - and you therefore may have more success when not "scaring" people off with a power-related question. I may, however, stand corrected :)

Yes. Use a profiler and see what routines are using most of the CPU. On ARM you can use some JTAG connectors, if available (I used Lauterbach both for debugging and for profiling). The main problem is generally to put your processor, when in idle, in a low-consumption state (deep sleep). If you cannot reduce the CPU percentage used by much (for example from 80% to 50%) it won't make a big difference. Depending on what operating systems you are running the options may vary.

The July 2010 edition of the Communications of the ACM has an article on energy-efficient algorithms which might interest you. I haven't read it yet so cannot impart any of its wisdom.

Try to stay in on chip memory (cache) for idle loops, keep I/O to a minimum, keep bit flipping to a minimum on busses. NV memory like proms and flash consume more power to store zeros than ones (which is why they erase to ones, it is actually a zero but the transitor(s) invert the bit before you see it, zeros stored as ones, ones stored as zeros, this is also why they degrade to ones when they fail), I dont know about volatile memories, dram uses half as many transistors as sram, but has to be refreshed.
For all of this to matter though you need to start with a lower power system as the above may not be noticeable. dont use anything from intel for example.

If you are not running Windows XP+ or a newer version of Linux, you could run a background thread which does nothing but HLT.
This is how programs like CPUIdle reduce power consumption/heat.

If the processor is tuned to use less power when it needs less cycles, then simply making your code run more efficiently is the solution. Else, there's not much you can do unless the operating system exposes some sort of power management functionality.

Keep IO to a minimum.

On some ARM processors it's possible to reduce power consumption by putting the voltage regulator in standby mode.

Measuring CPU clocks consumed by a process

I have written a program in C. Its a program created as result of a research. I want to compute exact CPU cycles which program consumes. Exact number of cycles.
Any idea how can I find that?

The valgrind tool cachegrind (valgrind --tool=cachegrind) will give you a detailed output including the number of instructions executed, cache misses and branch prediction misses. These can be accounted down to individual lines of assembler, so in principle (with knowledge of your exact architecture) you could derive precise cycle counts from this output.
Know that it will change from execution to execution, due to cache effects.
The documentation for the cachegrind tool is here.

No you can't. The concept of a 'CPU cycle' is not well defined. Modern chips can run at multiple clock rates, and different parts of them can be doing different things at different times.
The question of 'how many total pipeline steps' might in some cases be meaningful, but there is not likely to be a way to get it.

Try OProfile. It use various hardware counters on the CPU to measure the number of instructions executed and how many cycles have passed. You can see an example of it's use in the article, Memory part 7: Memory performance tools.

I am not entirely sure that I know exactly what you're trying to do, but what can be done on modern x86 processors is to read the time stamp counter (TSC) before and after the block of code you're interested in. On the assembly level, this is done using the RDTSC instruction, which gives you the value of the TSC in the edx:eax register pair.
Note however that there are certain caveats to this approach, e.g. if your process starts out on CPU0 and ends up on CPU1, the result you get from RDTSC will refer to the specific processor core that executed the instruction and hence may not be comparable. (There's also the lack of instruction serialisation with RDTSC, but in this context here, I don't think that's so much of an issue.)

Sorry, but no, at least not for most practical purposes -- it's simply not possible with most normal OSes. Just for example, quite a few OSes don't do a full context switch to handle an interrupt, so the time spent servicing a interrupt can and often will appear to be time spent in whatever process was executing when the interrupt occurred.
The "not for practical purposes" would indicate the possibility of running your program under a cycle accurate simulator. These are available, but mostly for CPUs used primarily in real-time embedded systems, NOT for anything like a full-blown PC. Worse, they (generally) aren't for running anything like a full-blown OS, but for code that runs on the "bare metal."
In theory, you might be able to do something with a virtual machine running something like Windows or Linux -- but I don't know of any existing virtual machine that attempts to, and it would be decidedly non-trivial and probably have pretty serious consequences in performance as well (to put it mildly).

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight