Intel CPUs from SandyBridge and newer have MSRs that allow to get an accurate energy consumption (Measured in micro joules). These are visible to the kernel (RAPL - Running average power limit). Is there an equivalent option for ARM CPUs?
Related
On a servers or a desktops it's reasonable to just disable frequency scaling and then one can run microbenchmarks.
But how do you meaningfully run microbenchmarks on a arm? (Given different cores and energy efficiency)
Specific context:
Processor: M2
Microbenchmakrs for basic algorithms (think sort, strlen - a long those lines).
I am interested in measuring both energy efficient and high-performance cores.
UPD: to avoid confusion: I'm pretty sure the library (GoogleBenchmark) can measure the time correctly. I just want to run the binary correctly
Are there c or c++ libraries that can read the PMC counters of AMD CPUs (Ryzen 7) such as the ones measured by uProf? Things such as cache misses and cpu clock cycles.
I've only come accross AMD Master Monitoring SDK here: https://developer.amd.com/amd-ryzentm-master-monitoring-sdk/
However it appears to only supports series 3000 Ryzen CPUs.
I'm trying to run a C x86 application in a raspberry using Exagear.
In my laptop the CPU consumption of the C applications while is running is about 50-60%. When I run the same C application in the raspberry, the CPU consumption is about 300%. I don't know why this CPU consumption difference between my laptop and the raspberry using exagear.
My raspberry is a Quad core Cortex A53 processor # 1.2 GHz with Videocore IV GPU 1GB LPDDR2 RAM. While my laptop virtual machine have two processors and 4GB RAM.
I'm thinking that maybe there's some kind of problema using my C application with exagear.
I would like to know if I could check more things to try to figure out which is the cause of this high CPU consumption.
If I don't utilize multithreaded paradigms when designing my code, will hyperthreading split the load automagically over the logical cores, or would my have to be specicially written to take advantage of the other cores like it would have to be for physical cores?
On suggestion of #us2012 I post this here from my comment...
There is no such magic. Superscalar CPUs, especially OOO (Out Of Order execution) processors do magic - but that is inside one core.
On the contrary, Hyperthreading can be thought of as (very simplified) two pipelines in front of one complete core.
AMD Bulldozer CPUs have a similar bit, but they went a step further: the integer core is split into two too, but the two pipelines + integer cores share one floating point unit. This whole is called a "module", having two threads.
TL;DR
Superscalar (from the Wiki)
A superscalar CPU architecture implements a form of parallelism called instruction level parallelism within a single processor. It therefore allows faster CPU throughput than would otherwise be possible at a given clock rate. A superscalar processor executes more than one instruction during a clock cycle by simultaneously dispatching multiple instructions to redundant functional units on the processor. Each functional unit is not a separate CPU core but an execution resource within a single CPU such as an arithmetic logic unit, a bit shifter, or a multiplier.
Out of order execution (from the Wiki)
In computer engineering, out-of-order execution (OoOE or OOE) is a paradigm used in most high-performance microprocessors to make use of instruction cycles that would otherwise be wasted by a certain type of costly delay. In this paradigm, a processor executes instructions in an order governed by the availability of input data, rather than by their original order in a program. In doing so, the processor can avoid being idle while data is retrieved for the next instruction in a program, processing instead the next instructions which are able to run immediately.
Hyperthreading (from... you know where...)
Hyper-threading (officially Hyper-Threading Technology or HT Technology, abbreviated HTT or HT) is Intel's proprietary simultaneous multithreading (SMT) implementation used to improve parallelization of computations (doing multiple tasks at once) performed on PC microprocessors. It first appeared in February 2002 on Xeon server processors and in November 2002 on Pentium 4 desktop CPUs.1 Later, Intel included this technology in Itanium, Atom, and Core 'i' Series CPUs, among others.
Bulldozer (not from not the wiki)
Bulldozer is the first major redesign of AMD’s processor architecture since 2003, when the firm launched its K8 processors, and also features two 128-bit FMA-capable FPUs which can be combined into one 256-bit FPU. This design is accompanied by two integer clusters, each with 4 pipelines (the fetch/decode stage is shared). Bulldozer will also introduce shared L2 cache in the new architecture. AMD's marketing service calls this design a "Module". A 16-threads processor design would feature eight of these "modules",[7] but the operating system will recognize each "module" as two logical cores.
I'm working with optimizing a software and wants to measure the performance. So I am currently simulating an ARM platform with OVP (open virtual platform) and I get the statistics as simulation time and simulated instructions.
My question is, why is the simulated instructions different everytime I run the software (different, but close proximity)? Should it not be the same everytime? Is it not like this , the software that I write in C will be compiled into ARM assembler instructions, and each time the software runs, the simulated instructions will be how many time these ARM assembler instructions run? It should be the same everytime?
How should I measure performance? Take 10 samples of simulated instructions and get the average?
From my experience in a real (non-simulated) ARM, if I take cycle counts for a section of the code the number of cycles will vary, this is because:
There can be context switches in the middle of your executing code.
The initial state of the CPU may be different upon entering the code section. (e.g. the content of the pipeline, branch prediction etc.)
The cache state will be different on entry to the code section.
External factors such as other hardware accessing external memory.
Due to all these, taking an average (plus some other statistical measures) is really the only practical approach for real hardware and a real OS. In a good simulator some of these factors or potentially eliminated.
On some real chips (or if supported by the simulator) the ARM Performance Monitoring Unit can be useful.
If you're coding for the Cortex A8 this is a cool online cycle counter that can really help you squeeze more performance out of your code.