What is the benefit of disabling dual-issue functionality on a ARM Cortex M7 processor? - arm

The Cortex M7 provides the possibility to disable dual-issue.
I understand the benefit of dual issue functionality but I don't really see the drawback.
Are there some programs that are more efficient without dual issue ? (perhaps programs withe many branches ?)
Is it linked to power consumption ?

Answer from an expert of ARM architectures: disabling dual-issue might be used (among other solutions) when the processor is overheating. Less transistors are used.

Related

Do efficiency cores support the same instructions as performance cores?

When writing a program that requires high computational performance, it is often required that multiple threads, SIMD vectorization, or other extensions are required. One can query the CPU using CPUID to find out what instruction set it supports. However, since the programmer has no control over which cores are actually executing the different threads, it could be a problem if different cores support different instruction sets.
If one queries the CPU at the start of the program, is it safe to assume all threads will support the same instruction set? If not, then does this break programs that assume they do all support the same instructions or are the CPUs clever enough to realize they shouldn't use those cores?
Does one need to query CPUID on each thread separately?
Is there any way a program can avoid running on E-cores?
If the instruction sets are the same, then where is the 'Efficiency'? Is it with less cache, lower clock speed, or something else?
This question is posed out of curiosity, but the answers may affect how I write programs in the future. I would appreciate any informed comments on these questions but please don't just share your thoughts and opinions on how you think it works if you don't know with high confidence. Thanks.
I have only tried to find information on the internet, but found nothing of sufficiently low level to answer these questions adequately.
Do efficiency cores support the same instructions as performance cores?
Yes (for Intel's Alder lake, but also for big.LITTLE ARM).
For Alder Lake; operating systems were "deemed unable" to handle heterogeneous CPUs; so Intel nerfed existing support for extensions that already existed in performance cores (primarily AVX-512) to match the features present in the efficiency cores.
Sadly, supporting heterogeneous CPU isn't actually hard in some cases (e.g. hypervisors that don't give all CPUs to a single guest) and is solvable in the general case; and failing to provide a way to re-enable disabled extensions (if an OS supports heterogeneous CPUs) prevents an OS from trying to support heterogeneous CPUs in future; essentially turning a temporary solution into a permanent problem.
Does one need to query CPUID on each thread separately?
Not for the purpose of determining feature availability. If you have highly optimized code (e.g. code tuned differently for different CPU types) you might still want to (even though it's not a strict need); but will also need to pin the thread to a specific CPU or group of CPUs.
Is there any way a program can avoid running on E-cores?
Potentially, via. CPU affinity. Typically it just makes things worse though (better to run on an E core than to not run at all because P cores are already busy).
If the instruction sets are the same, then where is the 'Efficiency'? Is it with less cache, lower clock speed, or something else?
Lower clock, shorter pipeline, less aggressive speculative execution, ...

Where to start ARM Cortex-A programming

I have experience with Cortex-M controllers (LPC series from NXP) and Keil.
I want to move for cortex-A because my logic needs some better speed.
I found from internet that these processors will come with linux in it.
How can i use my code directly rather than using linux??
I don't need IO pins.
Where should i start?? What IDE should i use??
And i found debugging of Cortex-A controllers is tough because it is involving OS. is it true?
And is there any way without going for cortex A but achieving higher speeds (around Giga Hz)
By Cortex-M series, I suppose you have experience with M0 and M3. Right?
If you plan on using A-Series, you should know that they are more designed to run operating systems (than M-Series). (For example they have virtual memory management units...) That's why you may not find much bare-metal programming guides with these processors.
Also, these devices don't usually have on-board ROMs. So, you don't have an embedded flash... Therefore, you basically use an SD-Card or eMMC to boot them.
You may use Linux (Easier for you but won't be real-time), or an RTOS (also easier). If that doesn't suit you, you may use "UBoot" from SD-Card or eMMC and do a couple non-trivial steps (dependent on architecture) to run your bare-metal software (which is loaded from SD-Card or eMMC).
I suggest you buy a beagle bone and start from there.
You can still use Cortex-A for normal bare metal application adn with this way you will have something similair to what to what you had with application running on cortex-m
However it really depends from what you want:
if you want to understand how cortex-a is working or you are bringing
up a custom platform which is not that stable so bare metal coding is
your answer and with it you will be able learn a lot bout cortex-a
functionality
If you want to use Cortex-A from user point of view so you need to
compile your linux kernel for your cortex-a based board and start
using developing on top of your running kernel

Cortex M0 vs M0+ Programming perspective

I am struggling with which cortex to choose.
Currently I have a design guy that will give me an M0 with memory for initial development but I want to use M0+ eventually.
Assuming I give up the optional features of the M0+ (MPU and MTB), can I transfer the M0 code to the M0+ without any changes?
I mean, is it the same libraries? same build commands? Linker?
What differences should I consider? I know they have the same ISA so I figured it shouldn't be a problem.
Thanks.
If you just consider M0 versus M0+ and not the system peripherals, all code compiled for Cortex-M0 should work on a Cortex-M0+ platform. They use the same instruction set and programmer model.
The main differences are about MPU and MTB, but also the fact that Cortex-M0 has no User-mode support (All code runs in privileged, ie CONTROL.nPRIV cannot be 1).

Can you check performance of a program running with Qemu Simulator?

Say if I am running an ARM simulator using Qemu, is it possible to find the time of execution of a program as it would be on the real ARM processor. In other words if I use functions such as gettimeofday, in a program running on the simulator, to check the elapsed time, will the elapsed time be given accurately through the cycle-accurate simulation?
Investigation in this issue at our company concluded that Qemu (for the ARM) is not cycle accurate. If I remember correctly cycle accuracy is not a goal of Qemu, instead it aims at fast emulation. Beware also that exact timing is dependent on quite unpredictable things like cache hits and misses. It will also depend on the actual architecture chosen. Note that ARM is merely an instruction set IP and several different implementations exist. If in addition an operating system is emulated, things get even more unpredictable.
We use the simulator from ARM to evaluate performance, but even that one is not fully cycle accurate for the latest versions of the ARM architecture.
GEM5
I have seen a researcher use gem5 for this. This paper evaluates how accurate it is. And I've created an easy to get started setup on GitHub.
As Bryan mentioned QEMU is designed for speed: only a valid x86 API behavior must be reached, not necessarily with the right number of cycles or in the same pipeline order. This is also called functional emulation.
Furthermore, DRAM memory accesses are assumed to be immediate, and therefore it makes no sense to emulate caches either. And as we know, current CPUs are basically memory latency hiding machines.
Cycle accurate emulators on the other hand, also emulate CPU internals, and are therefore way slower.
The root of the problem is of course the under documented performance features of processors, which vendors don't release to prevent intellectual property leakage.
GEM5 appears to implement a generic version of common CPU internals, so it should be more cycle accurate than functional emulators, but true cycle accurate emulation is likely impossible without insider knowledge.
Third party emulation implementors must then reverse engineer CPU performance from experiments and existing documentation.
Some of the key "internals" are cache, pipeline and branch prediction.
Related:
Question that asks how cycle accurate emulators are possible at all: How can CAS simulators like PTLsim achieve cycle accurate simulation of x86 hardware?
ARM Cycle-Accurate Simulator

Intel atom or ARM for heavy Signal processing workload

I would like to know which is a better (in performance) option :
To get a Intel Dual core atom based board
To get a Arm cortex A9 based board (pandaboard etc)
I would like to run some light version of linux and do some very cpu intensive
computations like Image/Video processing (maybe 3D later) and also process audio
on them. Of-course all floating point mathematics.
Definitely #2, Pandaboard is an OMAP4 platform.
OMAP4 contains not only the ARM Cortex A9 (which is not likely to compete on it's own with dual core Atom), but, and this is crucial, a full C674x DSP core, both floating and fixed point mathematics.
The embedded DSP core in OMAP4 is fully capable of handling 1080p H.264 decode, with some resources to spare. I'm yet to see an Atom platform capable of that.
(shameless plug - my company is using OMAP3 and evaluating OMAP4 for some of our niche markets, and we might be interested in assisting in yours as well)

Resources