Number of Performance Monitoring Units in ARM Cortex-A53 - arm

How many performance monitoring units (PMU) are in ARM Cortex-A53? Is there one PMU for each core or one PMU for the whole processor?

There is one per core. You can kind of infer this from seeing one entry for each in the ROM table.
In the Cortex-A53 TRM, Fig 2-1 alludes to debug being located per core, and 2.1.9
• ARM v8 debug features in each core.
I don't see anything explicit that there is one PMU instance per core (architectural or not), but it's possible that I missed this since there are a few places where it might be specified.
Section 11.10.1 describes the debug memory map, consisting of a ROM table (index of other components in this group), then CPU debug, CTI, PMU and Trace for each core. To check what is present, you need to read DBGDRAR to find the base of the ROM table, and check bit[0] of the entries listed in Table 11-28. In a 4-core A53, you should find that all 16 devices are present.
One further point, although the PMU is per core and generally counts the events for the core, there can be some processor level events which are visible equally to all PMUs.

Related

How to Access GPU Registers on Linux

I'm attempting to access the GPU registers on an AMD HD7000 series card. I have sensors attached to the GPUs power input and consumption can be measured. I'd like to develop a Linux C++ application to access the performance counters of the GPU to assess power consumption based on GPU events/operations. I am familiar with C and C++, but I am not familiar with operating systems, and I'm not exactly sure where to get started for such a task.
I have a PDF. It is the Bios and Kernel Developers Guide to AMD 15h Family Models 30h-3Fh located here:
https://support.amd.com/TechDocs/49125_15h_Models_30h-3Fh_BKDG.pdf
In the linked document, on page 106, the performance counters are described. The counters must be set via a control register before they begin counting. On page 715, the counters are defined in more detail.
This document does not list the memory offset, so I checked the open source AMD Linux 4.14.12 drivers. Under /drivers/gpu/drm/amd/amdgpu/sid.h line 1256, I found a performance counter address of 0x2688:
#define CB_PERFCOUNTER0_SELECT0 0x2688
http://elixir.free-electrons.com/linux/v4.14.12/source/drivers/gpu/drm/amd/amdgpu/sid.h#L1256
Also, a start and stop counter integer beginning line 1457:
#define PERFCOUNTER_START (23 << 0)
#define PERFCOUNTER_STOP (24 << 0)
http://elixir.free-electrons.com/linux/v4.14.12/source/drivers/gpu/drm/amd/amdgpu/sid.h#L1457
I'm assuming the CB_PERFCOUNTER defines are relevant addresses and can be used as an offset to the GPU's register memory location. I have searched on the internet very much, and there are many resources, although it is difficult for me to assess whether each resource has any relevant information or how any step fits into the process as a whole. Any direct explanations or references to material would be greatly appreciated.

PMU for multi threaded environment

I am planning to measure PMU counters for L1,L2,L3 misses branch prediction misses , I have read related Intel documents but i am unsure about the below scenarios.could some one please clarify ?
//assume PMU reset and PERFEVTSELx configurtion done above
ioctl(fd, IOCTL_MSR_CMDS, (long long)msr_start) //PMU start counters
my_program();
ioctl(fd, IOCTL_MSR_CMDS, (long long)msr_stop) ///PMU stop
//now reading PMU counters
1.what will happen if my process is scheduled out when my_program() is running, and scheduled to another core?
2.what will happen if process scheduled out and schedule back to same core again, meanwhile some other process reset the PMU counters?
How to make sure that we are reading the correct values from PMU counters.?
Machine details:CentOS with Linux kernel 3.10.0-327.22.2.el7.x86_64 , which is powered up with Intel(R) Core(TM) i7-3770 CPU # 3.40GHz
Thanks
Summary of the Intel forum thread started by the OP:
The Linux perf subsystem virtualizes the performance counters, but this means you have to read them with a system call, instead of rdpmc, to get the full virtualized 64-bit value instead of whatever is currently in the architectural performance counter register.
If you want to use rdpmc inside your own code so it can measure itself, pin each thread to a core because context switches don't save/restore PMCs. There's no easy way to avoid measuring everything that happens on the core, including interrupt handlers and other processes that get a timeslice. This can be a good thing, since you need to take the impact of kernel overhead into account.
More useful quotes from John D. McCalpin, PhD ("Dr. Bandwidth"):
For inline code instrumentation you should be able to use the "perf events" API, but the documentation is minimal. Some resources are available at http://web.eece.maine.edu/~vweaver/projects/perf_events/faq.html
You can use "pread()" on the /dev/cpu/*/msr device files to read the
MSRs -- this may be a bit easier to read than IOCTL-based code. The
codes "rdmsr.c" and "wrmsr.c" from "msr-tools-1.3" provide excellent
examples.
There have been a number of approaches to reserving and sharing
performance counters, including both software-only and combined
hardware+software approaches, but at this point there is not a
"standard" approach. (It looks like Intel has a hardware-based
approach using MSR 0x392 IA32_PERF_GLOBAL_INUSE, but I don't know what
platforms support it.)
your questions
what will happen if my process is scheduled out when my_program() is running, and scheduled to another core?
You'll see random garbage, same if another process resets PMCs between timeslices of your process.
i got the answers from some Intel forum, the link is below.
https://software.intel.com/en-us/forums/intel-moderncode-for-parallel-architectures/topic/673602

Intel Performance Monitor -- any way to monitor per-process?

How would I go about monitoring a particular process's execution (namely, its branches, from the Branch Trace Store) using the Intel Performance Counter monitor, while filtering out other process's information?
You should know that BTS (Branch trace store) and Performance monitoring events/counters (inside CPU, its PMU block) are very different things.
The Branch Trace Store is function of CPU when it does record every taken branch (pairs of eip - first of branch instruction and second of branch target; there is also a word of flags added to each pair) in special area of memory. Result of it is very like to Single-stepping and recording order of executed code blocks (basic blocks). It is just like doing code coverage with assistance from compiler, when every branch is instrumented by compiler.
BTS is just a bit in the MSR_DEBUGCTLA MSR (it is intel x86 register); I'm almost sure that this register is thread-specific (as it is in Linux), so you need no to hook scheduler. There is some examples of working with this MSR in windows; but different bit is used. Also, don't forget to set DS_AREA correctly. So, if you really want BTS, take a copy of Intel Arch Manual (Volume 3b, Part "Debugging and Performance monitoring", section "19.7.8 Branch Trace Store (BTS)") and program BTS manually. Hardest part is to handle DS area overflow (you need custom interrupt handler).
If you want to know not a trace of executed code but statistics of you program (how much instructions executed; how well was branches predicted; how much indirect branches are here ...), you should use Performance monitoring events aka "Precise Event Based Sampling" (PEBS). Intel Vtune does this; there should be some other tools, even the Intel PBS your linked. The only problem (this is bit more difficult with free tools) is to find name of Events you want. Events based on instruction execution are always binded to some thread.
What does event-based sampling means: you can set some limit, e.g. 1000 for some event, eg. BR_INST_EXEC.COND ("number of conditional
near branch instructions executed") or BR_INST_EXEC.DIRECT ("all unconditional near branch instructions excluding calls and indirect branches."), up to 2-4 events at once. Then CPU will count every situation which correspond to this event. When there will be 1000th situation, the Event (interrupt) will be generated for instrution EIP. With sampling it is easy to get detailed statistics of your code behaviour. If you will set limit to something very low and if you will not sum events for eip, you will get trace ;)
With PEBS you can know how bad is your code for the CPU, where mispredicted branches are located, which instructions wait data from cache, etc. There are 100s of events (appendix A of Volume 3b).
PS there is some code for BTS/win: http://blog.csdn.net/quincy_hu/article/details/4053163
PPS there is shorter overview of PMU programming, both PEBS and BTS. software.intel.com/file/30320 It is for Nehalem, but it can be actual even for Sandy.
We were forced to build our own instrumenting profiler that reads the MSRs directly to get this information. The Performance Counter Monitor's source code demonstrates how to build a kernel driver that reads them.
Previously we used VTune, but it crashes when run on our app. (When we tried OProfile on the Linux version, it actually crashed the entire kernel and forced us to power-cycle the machine, which was pretty funny.)
Check out https://github.com/andikleen/pmu-tools/blob/master/toplev.py
Examples:
toplev.py -l2 program
measure whole system in level 2 while program is running

Hardware prefetching in corei3

Does corei3 support hardware prefetching through hardware prefetcher? If yes, how do I enable/disable it?
Intel Core i3 processors definitely support hardware prefetching, though Intel's documentation tends to be very weak on details. The brand name "Core i3" refers to both "Nehalem" based and "Sandy Bridge" based processors, so you have to check the specific model number to know which one you are dealing with.
To make things more complicated, newer Intel processors (Nehalem/Westmere/Sandy Bridge) have several different hardware prefetchers -- at least three are mentioned in the Intel Architecture Software Developer's Manual, Volume 3B (publication 253669). Table 30-25 "MSR_OFFCORE_RSP_x Request Type Field Definition" mentions "DCU prefetch" and "L2 prefetchers". These are also mentioned in Appendix A-2, Table A-2, which describes the performance counter events for Core i7, i5, and i3 processors. Event 4EH in Table A-2 mentions that there are both "L1 streamer and IP-Based (IPP) HW prefetchers". There are a few more words on this topic in the corresponding entry (for event 4EH) in Appendix A.4, Table A-6, which describes the performance counters for Westmere processors.
Appendix B-2, Table B-3 in the same document discusses the MSRs (Model Specific Registers) for the Intel Core Microarchitecture, but it looks like many of these carry over into newer versions. Register 1A0h shows that 4 bits control prefetching behavior:
Bit 9: Hardware Prefetcher Disable
Bit 19: Adjacent Cache Line Prefetch Disable
Bit 37: DCU Prefetcher Disable
Bit 39: IP Prefetcher Disable
Tools to enable and disable prefetchers are discussed in:
How do I programmatically disable hardware prefetching?
Yes, Hardware prefetcher does exist in Core i3/i7 machine, but you CAN'T disable them in i3/i7. Two ways to disable the prefetching (1) by changing msr bit (2) through bios. Intel stopped supporting both ways to disable in i3/i7.
Link from comment: https://software.intel.com/en-us/articles/disclosure-of-hw-prefetcher-control-on-some-intel-processors Disclosure of H/W prefetcher control on some Intel processors - Vish Viswanathan (Intel), September 24, 2014
This article discloses the MSR setting that can be used to control the various h/w prefetchers that are available on Intel processors based on the following microarchitectures: Nehalem, Westmere, Sandy Bridge, Ivy Bridge, Haswell, and Broadwell.
The above mentioned processors support 4 types of h/w prefetchers for prefetching data. There are 2 prefetchers associated with L1-data cache (also known as DCU DCU prefetcher, DCU IP prefetcher) and 2 prefetchers associated with L2 cache (L2 hardware prefetcher, L2 adjacent cache line prefetcher).
There is a Model Specific Register (MSR) on every core with address of 0x1A4 that can be used to control these 4 prefetchers. Bits 0-3 in this register can be used to either enable or disable these prefetchers. Other bits of this MSR are reserved.

ARM modes and why are there so many?

I'm currently reading/learning about ARM architecture ...
and I was wondering why there are so many modes
(FIQ, User, System, Supervisor, IRQ, ...).
My question is why do we need so many modes? Wouldn't just User and System be enough?
Thanks in advance.
It's just an architectural decision. The big advantage of the multiple modes is that they have some banked registers. Those extra registers allow you to write much less complicated exception routines.
If you were to pick only two, just USR and SYS are probably as good a choice as any, but what would happen when you took an exception? The normal ARM model is to go to an exception mode, set the banked link register for that exception mode to point to the instruction you want to return to after you resolve the exception, save the processor state in the exception mode's SPSR register, and then jump to the exception vector. USR and SYS share all their registers - using this model, you'd blow away your function return address (in LR) every time you took an interrupt!
The FIQ mode in particular has even more banked registers than the other exception modes. Those extra registers are in keeping with the "F" part of FIQ - it stands for "Fast". Not having to save and restore more processor context in software will speed up your interrupt handler.
Not too much to add to Carl's answer. Not sure what family / architecture of ARM processors you're talking about, so I'll just assume based on your question (FIQ, IRQ, etc.) that you're talking about ARM7/9/11. I won't enumerate every difference between every mode in every ARM architecture variant.
In addition to what Carl said, a few other advantages of having different modes for different circumstances:
for example, in the FIQ, you don't have to branch off right away, you can just keep on executing. With other exceptions you have to branch right away
with different modes, you have natural support for separate stacks. If you're multitasking (e.g., RTOS) and you don't have a separate stack when you're in an interrupt mode, you have to build-in extra space onto each task stack for the worst-case interrupt situation
with different modes, certain registers (e.g. CPSR, MMU regs, etc. - depends on architecture) are off-limits. Same thing with certain instructions. You don't want to let user code modify privileged registers, now do you?

Resources