Using PMU to count ticks on ARM - arm

I'm trying to use the PMU (specifically use PMCCNTR to determine ticks per usec) from userspace in ARM. I have an arm64 kernel running an arm32 bit userspace app.
I created an LKM to force the PMUSERENR.EN bit to be on, which works. I then I ran a test program from userspace:
asm volatile ("mrc p15, 0, %[en], c9, c14, 0"
: [en] "=r" (pmuserenr));
printf(" -- %08x\n", pmuserenr);
The first few times I ran this, the bit was correctly marked as turned on. But every third or fourth time I run it, the bit shows up as off. After puzzling over this, I came across this link, where the user is enabling this bit for each cpu. Does each core have it's own instance of a PMU, or is there only one per SoC? If there are multiple per SoC, are the PMUCCNTR registers synced up? (That is, if I read the PMUCCNTR, then do a context switch to another core, and read PMUCCNTR again, can I safely compare the two to see how many ticks have elapsed?).
Even if I modify my LKM to initialize on every online-cpu (and register a notifier to enable it on all new cpu's coming up), I'm still seeing the same behavior, where every nth read from userspace shows the bit not set.
My other issue is that the PMCCNTR is not incrementing. I set the PMCR.E bit but it's still not incrementing. I found one website that says I need to set the PMCR.C bit -- but this has a side effect of clearing the counter, which I don't want to do in fear of creating a race condition (where someone else is using the counter as I clear it). Any help or insights would be very welcome.

Each cpu has its own PMU instance, they are entirely independant (except for a small number of cluster wide events, where each PMU keeps its own count).
See, for example, the first hit today for "debug memory map pmu", remembering that v7 and v8 debug memory map standards are different. The actual answer is too 'obvious' to expect it to be explicitly documented.
0x010000 - 0x010FFF CPU 0 Debug
0x030000 - 0x030FFF CPU 0 PMU
0x110000 - 0x110FFF CPU 1 Debug
0x130000 - 0x130FFF CPU 1 PMU
0x210000 - 0x210FFF CPU 2 Debug
0x230000 - 0x230FFF CPU 2 PMU
0x310000 - 0x310FFF CPU 3 Debug
0x330000 - 0x330FFF CPU 3 PMU

Related

Unexpected timing in Arm assembly

I need a very precise timing, so I wrote some assembly code (for ARM M0+).
However, the timing is not what I expected when measuring on an oscilliscope.
#define LOOP_INSTRS_CNT 4 // subs: 1, cmp: 1, bne: 2 (when branching)
#define FREQ_MHZ (BOARD_BOOTCLOCKRUN_CORE_CLOCK / 1000000)
#define DELAY_US_TO_CYCLES(t_us) ((t_us * FREQ_MHZ + LOOP_INSTRS_CNT / 2) / LOOP_INSTRS_CNT)
static inline __attribute__((always_inline)) void timing_delayCycles(uint32_t loopCnt)
{
// note: not all instructions take one cycle, so in total we have 4 cycles in the loop, except for the last iteration.
__asm volatile(
".syntax unified \t\n" /* we need unified to use subs (not for sub, though) */
"0: \t\n"
"subs %[cyc], #1 \t\n" /* assume cycles > 0 */
"cmp %[cyc], #0 \t\n"
"bne.n 0b\t\n" /* this instruction costs 2 cycles when branching! */
: [cyc]"+r" (loopCnt) /* actually input, but we need a temporary register, so we use a dummy output so we can also write to the input register */
: /* input specified in output */
: /* no clobbers */
);
}
// delay test
#define WAIT_TEST_US 100
gpio_clear(PIN1);
timing_delayCycles(DELAY_US_TO_CYCLES(WAIT_TEST_US));
gpio_set(PIN1);
So pretty basic stuff. However, the delay (measured by setting a GPIO pin low, looping, then setting high again) timing is consistently 50% higher than expected. I tried for low values (1 us giving 1.56 us), up to 500 ms giving 750 ms.
I tried to single step, and the loop really does only the 3 steps: subs (1), cmp (1), branch (2). Paranthesis is number of expected clock cycles.
Can anybody shed light on what is going on here?
After some good suggestions I found the issue can be resolved in two ways:
Run core clock and flash clock at the same frequency (if code is running from flash)
Place the code in the SRAM to avoid flash access wait-states.
Note: If anybody copies the above code, note that you can delete the cmp, since the subs has the s flag set. If doing so, remember to set instruction count to 3 instead of 4. This will give you a better time resolution.
You can't use these processors like you would a PIC, the timing doesn't work like that. I have demonstrated this here many times you can look around, maybe will do it again here, but not right now.
First off these are pipelined so you average performance is one thing, but and once in a loop and things like caching and branch prediction learning and other factors have settled then you can get consistent performance, for that implementation. Ignore any documentation related to clocks per instruction for a pipelined processor now matter how shallow, that is the first problem in understanding why the timing doesn't work as expected.
Alignment plays a role and folks are tired of me beating this drum but I have demonstrated it so many times. You can search for fetch in the cortex-m0 TRM and you should immediately see that this will affect performance based on alignment. If the chip vendor has compiled the core for 16 bit only then that would be predictable or more predictable (ignoring other factors). But if they have compiled in the other features and if prefetching is happening as described, then the placement of the loop in the address space can affect the loop by plus or minus a fetch affecting the total time to complete the loop, which is measurable with or without a scope.
Branch prediction, which didn't show up in the arm docs as arm doing it but the chip vendors are fully free to do this.
Caching. While a cortex-m0+ if this is an STM32 or perhaps other brands as well, there is or may be a cache you can't turn off. Not uncommon for the flash to be half the speed of the processor thus flash wait state settings, but often the zero wait state means zero additional and it takes two clocks to get one fetch done or at least is measurable that execution in flash is half the speed of execution in ram with all other settings the same (system clock speed, etc). ST has a pretty good prefetch/caching solution with some trademarked name and perhaps a patent who knows. And rarely can you turn this off or defeat it so the first time through or the time entering the loop can see a delay and technically a pre-fetcher can slow down the loop (see alignment).
Flash, as mentioned depending on the chip vendor and age of the part it is quite common for the flash to be half speed of the core. And then depending on your clock rates, when you read about the flash settings in the chip doc where it shows what the required wait states were relative to system clock speed that is a key performance indicator both for the flash technology and whether or not you should really be raising the system clock up too high, the flash doesn't get any faster it has a speed limit, sram from my experience can keep up and so far I don't see them with wait states, but flashes used to be two or three settings across the range of clock speeds the part supports, the newer released parts the flashes are tending to cover the whole range for the slower cores like the m0+ but the m7 and such keep getting higher clock rates so you would still expect the vendors to need wait states.
Interrupts/exceptions. Are you running this on an rtos, are there interrupts going on are you increasing and/or guaranteeing that this gets interrupted with a longer delay?
Peripheral timing, the peripherals are not expected to respond to a load or store in a single clock they can take as long as they want and depending on the clocking system and chip vendors IP, in house or purchased, the peripheral might not run at the processor clock rate and be running at a divided rate making things slower. Your code no doubt is calling this function for a delay, and then outside this timing loop you are wiggling a gpio pin to see something on a scope which leads to how you conducted your benchmark and additional problems with that based on factors above and this one.
And other factors I have to remember.
Like high end processors like the x86, full sized ARMs, etc the processor no longer determines performance. The chip and motherboard can/do. You basically cannot feed the pipe constantly there are stalls going on all over the place. Dram is slow thus layers of caching trying to deal with it but caching helps sometimes and hurts others, branch predictors hurt as much as they help. And so on but it is heavily driven by the system outside the processor core as to how well you can feed the core, and then you get into the core's properties with respect to the pipeline and its own fetching strategy. Ideally using the width of the bus rather than the size of the instruction, transaction overhead so multiple widths of the bus is even more ideal that one width, etc.
Causing tight loops like this on any core to have a jerky motion and or be inconsistent in timing when the same machine code is used at different alignments. Now granted for size/power/etc the m0+ has a tiny pipe, but it still should show the affects of this. These are not pics or avrs or msp430s no reason to expect a timing loop to be consistent. At best you can use a timing loop for things like spi and i2c bit banging where you need to be greater than or equal to some time value, but if you need to be accurate or within a range, it is technically possible per implementation if you control many of the factors, but it is often not worth the effort and you have this maintenance issue now or readability or understandability of the code.
So bottom line there is no reason to expect consistent timing. If you happened to get consistent/linear timing, then great. The first thing you want to do is check that when you changed and re-built the code to use a different value for the loop that it didn't affect alignment of this loop.
You show a loop like this
loop:
subs r0,#1
cmp r0,#0
bne loop
on a tangent why the cmp, why not just
loop:
subs r0,#1
bne loop
But second you then claim to be measuring this on a scope, which is good because how you measure things plays into the quality of the benchmark often the benchmark varies because of how it is measured the yardstick is the problem not the thing being measured, or you have problems with both then the measurement is much more inconsistent. Had you used systick or other to measure this depending on how you did that the measurement itself can cause the variation, and even if you used gpio to toggle a pin that can and probably is affecting this as well. All other things held constant simply changing the loop count depending on the immediate and the value used could push you between a thumb and thumb2 instruction changing the alignment of some loop.
What you have shown implies you have this timing loop which can be affected by a number of system issues, then you have wrapped that with some other loop itself being affected, plus possibly a call to a gpio library function which can be affected by these factors as well from a performance perspective. Using inline assembly and the style in which you wrote this function that you posted implies you have exposed yourself and can easily see a wide range of performance differences in running what appears to be the same code, or even actually the code under test being the same machine code.
Unless this is a microchip PIC, not PIC32, or a very very short list of other specific brand and family of chips. Ignore the cycle counts per instruction, assume they are wrong, and don't try for accurate timing unless you control the factors.
Use the hardware, if for example you are trying to use the ws8212/neopixel leds and you have a tight window for timing you are not going to be successful or will have limited success using instruction timing. In that specific case you can sometimes get away with using a spi controller or timers in the part to generate accurately timed (far more than you can ever do with software timers managing the bit banging or otherwise). With a PIC I was able to generate tv infrared signals with the carrier frequency and ons and off using timed loops and nops to make a highly accurate signal. I repeated that for one of these programmable led things for a short number of them on a cortex-m using a long linear list of instructions and relying on execution performance it worked but was extremely limited as it was compile time and quick and dirty. SPI controllers are a pain compared to bit banging but another evening with the SPI controller and could send any length of highly accurately timed signals out.
You need to change your focus to using timers and/or on chip peripherals like uart, spi, i2c in non-normal ways to generate whatever signal this is you are trying to generate. Leave timed loops or even timer based loops wrapped by other loops for the greater than or equal cases and not for the within a range of time cases. If unable to do it with one chip, then look around at others, very often when making a product you have to shop for the components, across vendors, etc. Push comes to shove use a CPLD or a PAL or GAL or something like that to get highly accurate but custom timing. Depending on what you are doing and what your larger system picture looks like the ftdi usb chips with mpsse have a generic state machine that you can program to generate an array of signals, they do i2c, spi, jtag, swd etc with this generic programmable system. But if you don't have a usb host then that won't work.
You didn't specify a chip and I have a lot of different chips/boards handy but only a small fraction of what is out there so if I wanted to do a demo it might not be worth it, if mine has the core compiled one way I might not be able to get it to demonstrate a variation where the same exact core from arm compiled another way on another chip might be easy. I suspect first off a lot of your variation is because you are making calls within a bigger loop, call to delay call to state change the gpio, and you are recompiling that for the experiments. Or worse as shown in your question if you are doing a single pass and not a loop around the calls, then that can maximize the inconsistency.

Cortex-R5 PMU cycle counter not counting time correctly

I have the PMU configured correctly for PMCCNTR to tick on a Cortex-R5 running FreeRTOS. I will omit the configuration code since it's been repeated on many other StackOverflow questions. I believe the configuration is correctly because I tried running
__asm__ volatile("mrc p15, 0, %0, c9, c13, 0" : "=r"(pmccntr))
periodically, and I was able to see that the pmccntr variable increases monotonically and rolls over after (2^32 - 1).
The CPU is running at 800Mhz, so I expected that if I were to read PMCCNTR in a 1Hz task I would notice that the value increases by 800Mhz. However, the difference in PMCCNTR in between calls to the 1Hz task is more like 72 million. I also tried playing with the 64 clock divider to make sure my observations are sane.
Is my math correct? Or perhaps I am using the wrong number as the CPU frequency? What would be a deterministic way to figure out what frequency the PMCCNTR should be counting at?
Update: The root cause is WFI as #Sean Houlihane pointed out
The PMCCNTR does run at core clock, so long as the counter is not disabled and the core isn't in debug state. If you calculate 72 MHz/1.125 MHz then there is a good chance your core is running at external crystal frequency, not from the internal PLL.
The other likely explanation is that the core is in WFI state with clocks stopped for most of the time - in which case the result you measure will be influenced by the amount of work done by the OS.

PMU for multi threaded environment

I am planning to measure PMU counters for L1,L2,L3 misses branch prediction misses , I have read related Intel documents but i am unsure about the below scenarios.could some one please clarify ?
//assume PMU reset and PERFEVTSELx configurtion done above
ioctl(fd, IOCTL_MSR_CMDS, (long long)msr_start) //PMU start counters
my_program();
ioctl(fd, IOCTL_MSR_CMDS, (long long)msr_stop) ///PMU stop
//now reading PMU counters
1.what will happen if my process is scheduled out when my_program() is running, and scheduled to another core?
2.what will happen if process scheduled out and schedule back to same core again, meanwhile some other process reset the PMU counters?
How to make sure that we are reading the correct values from PMU counters.?
Machine details:CentOS with Linux kernel 3.10.0-327.22.2.el7.x86_64 , which is powered up with Intel(R) Core(TM) i7-3770 CPU # 3.40GHz
Thanks
Summary of the Intel forum thread started by the OP:
The Linux perf subsystem virtualizes the performance counters, but this means you have to read them with a system call, instead of rdpmc, to get the full virtualized 64-bit value instead of whatever is currently in the architectural performance counter register.
If you want to use rdpmc inside your own code so it can measure itself, pin each thread to a core because context switches don't save/restore PMCs. There's no easy way to avoid measuring everything that happens on the core, including interrupt handlers and other processes that get a timeslice. This can be a good thing, since you need to take the impact of kernel overhead into account.
More useful quotes from John D. McCalpin, PhD ("Dr. Bandwidth"):
For inline code instrumentation you should be able to use the "perf events" API, but the documentation is minimal. Some resources are available at http://web.eece.maine.edu/~vweaver/projects/perf_events/faq.html
You can use "pread()" on the /dev/cpu/*/msr device files to read the
MSRs -- this may be a bit easier to read than IOCTL-based code. The
codes "rdmsr.c" and "wrmsr.c" from "msr-tools-1.3" provide excellent
examples.
There have been a number of approaches to reserving and sharing
performance counters, including both software-only and combined
hardware+software approaches, but at this point there is not a
"standard" approach. (It looks like Intel has a hardware-based
approach using MSR 0x392 IA32_PERF_GLOBAL_INUSE, but I don't know what
platforms support it.)
your questions
what will happen if my process is scheduled out when my_program() is running, and scheduled to another core?
You'll see random garbage, same if another process resets PMCs between timeslices of your process.
i got the answers from some Intel forum, the link is below.
https://software.intel.com/en-us/forums/intel-moderncode-for-parallel-architectures/topic/673602

How to detect cold boot versus warm boot on an ARM processor?

I'm looking for a way to determine whether an ARM processor is booting from a cold boot (i.e. initial power-on) versus a warm boot (i.e. reset assertion without actual power loss). Specifically I'm using an ARM968 core, will be making the determination using C or assembly, and I will use the determination so certain operations only run on the initial power-on and not on subsequent resets. In previous projects I've leveraged external circuitry (e.g. FPGA) to detect the different boot scenarios, but in this case I am limited to the ARM core.
Check the docs for you specific chip ("ARM968" is not specific enough). There should be a register that describes the cause of reset. E.g. here's what LPC23xx has:
Reset Source Identification Register (RSIR - 0xE01FC180)
This register contains one bit for each source of Reset. Writing a 1 to any of these bits
clears the corresponding read-side bit to 0. The interactions among the four sources are
described below.
Bit Symbol Description
0 POR Assertion of the POR signal sets this bit, and clears all of the other bits in
this register. But if another Reset signal (e.g., External Reset) remains
asserted after the POR signal is negated, then its bit is set. This bit is not
affected by any of the other sources of Reset.
1 EXTR Assertion of the RESET signal sets this bit. This bit is cleared by POR,
but is not affected by WDT or BOD reset.
2 WDTR This bit is set when the Watchdog Timer times out and the WDTRESET
bit in the Watchdog Mode Register is 1. It is cleared by any of the other
sources of Reset.
3 BODR This bit is set when the 3.3 V power reaches a level below 2.6 V.
If the VDD(DCDC)(3V3) voltage dips from 3.3 V to 2.5 V and backs up, the
BODR bit will be set to 1.
If the VDD(DCDC)(3V3) voltage dips from 3.3 V to 2.5 V and continues to
decline to the level at which POR is asserted (nominally 1 V), the BODR
bit is cleared.
if the VDD(DCDC)(3V3) voltage rises continuously from below 1 V to a level
above 2.6 V, the BODR will be set to 1.
This bit is not affected by External Reset nor Watchdog Reset.
Note: Only in case when a reset occurs and the POR = 0, the BODR bit
indicates if the VDD(DCDC)(3V3) voltage was below 2.6 V or not.
You can initialize a global variable in RAM to a value that is unlikely during cold boot, and check for that during boot.
For microcontrollers normally the reset logic of the specific chip provides a status register, which indicates the source of the reset. I don't know if that exists for this bigger core, and whether you could use that.
It is likely to be difficult, and maybe you dont really mean just the core itself. The core should have gotten a reset, but the memory outside (but perhaps still within the chip) did not. if the memory is dram based then it may still get wiped on boot. I dont know of a generic one size fits all answer. both you and starblue have it though, you have to find some register somewhere that is not cleared on a reset, set that to something that is "likely" not to happen randomly on a power up. read it then set it. thinks like the fpga or pld that manage the reset logic at the board level (if any) are the best because on a power on reset they are reset as well, and on a warm reset they are the one that caused it and keep their state.
dig through the TRM for your core or through the register spec for the chip, and see if there are any registers whose reset state is undefined, one that you normally dont use and wont hurt the chip if you set it to something, and see what it powers up as, that is where I would start looking.

CPU clock frequency and thus QueryPerformanceCounter wrong?

I am using QueryPerformanceCounter to time some code. I was shocked when the code starting reporting times that were clearly wrong. To convert the results of QPC into "real" time you need to divide by the frequency returned from QueryPerformanceFrequency, so the elapsed time is:
Time = (QPC.end - QPC.start)/QPF
After a reboot, the QPF frequency changed from 2.7 GHz to 4.1 GHz. I do not think that the actual hardware frequency changed as the wall clock time of the running program did not change although the time reported using QPC did change (it dropped by 2.7/4.1).
MyComputer->Properties shows:
Intel(R)
Pentium(R)
4 CPU 2.80 GHz; 4.11 GHz;
1.99 GB of RAM; Physical Address Extension
Other than this, the system seems to be working fine.
I will try a reboot to see if the problem clears, but I am concerned that these critical performance counters could become invalid without warning.
Update:
While I appreciate the answers and especially the links, I do not have one of the affected chipsets nor to I have a CPU clock that varies itself. From what I have read, QPC and QPF are based on a timer in the PCI bus and not affected by changes in the CPU clock. The strange thing in my situation is that the FREQUENCY reported by QPF changed to an incorrect value and this changed frequency was also reported in MyComputer -> Properties which I certainly did not write.
A reboot fixed my problem (QPF now reports the correct frequency) but I assume that if you are planning on using QPC/QPF you should validate it against another timer before trusting it.
Apparently there is a known issue with QPC on some chipsets, so you may want to make sure you do not have those chipset. Additionally some dual core AMDs may also cause a problem. See the second post by sebbbi, where he states:
QueryPerformanceCounter() and
QueryPerformanceFrequency() offer a
bit better resolution, but have
different issues. For example in
Windows XP, all AMD Athlon X2 dual
core CPUs return the PC of either of
the cores "randomly" (the PC sometimes
jumps a bit backwards), unless you
specially install AMD dual core driver
package to fix the issue. We haven't
noticed any other dual+ core CPUs
having similar issues (p4 dual, p4 ht,
core2 dual, core2 quad, phenom quad).
From this answer.
You should always expect the core frequency to change on any CPU that supports technology such as SpeedStep or Cool'n'Quiet. Wall time is not affected, it uses a RTC. You should probably stop using the performance counters, unless you can tolerate a few (5-50) millisecond's worth of occasional phase adjustments, and are willing to perform some math in order to perform the said phase adjustment by continuously or periodically re-normalizing your performance counter values based on the reported performance counter frequency and on RTC low-resolution time (you can do this on-demand, or asynchronously from a high-resolution timer, depending on your application's ultimate needs.)
You can try to use the Stopwatch class from .NET, it could help with your problem since it abstracts from all this low-lever stuff.
Use the IsHighResolution property to see whether the timer is based on a high-resolution performance counter.
Note: On a multiprocessor computer, it
does not matter which processor the
thread runs on. However, because of
bugs in the BIOS or the Hardware
Abstraction Layer (HAL), you can get
different timing results on different
processors. To specify processor
affinity for a thread, use the
ProcessThread..::.ProcessorAffinity
method.
Just a shot in the dark.
On my home PC I used to have "AI NOS" or something like that enabled in the BIOS. I suspect this screwed up the QueryPerformanceCounter/QueryPerformanceFrequency APIs because although the system clock ran at the normal rate, and normal apps ran perfectly, all full screen 3D games ran about 10-15% too fast, causing, for example, adjacent lines of dialog in a game to trip on each other.
I'm afraid you can't say "I shouldn't have this problem" when you're using QueryPerformance* - while the documentation states that the value returned by QueryPerformanceFrequency is constant, practical experimentation shows that it really isn't.
However you also don't want to be calling QPF every time you call QPC either. In practice we found that periodically (in our case once a second) calling QPF to get a fresh value kept the timers synchronised well enough for reliable profiling.
As has been pointed out as well, you need to keep all of your QPC calls on a single processor for consistent results. While this might not matter for profiling purposes (because you can just use ProcessorAffinity to lock the thread onto a single CPU), you don't want to do this for timing which is running as part of a proper multi-threaded application (because then you run the risk of locking a hard working thread to a CPU which is busy).
Especially don't arbitrarily lock to CPU 0, because you can guarantee that some other badly coded application has done that too, and then both applications will fight over CPU time on CPU 0 while CPU 1 (or 2 or 3) sit idle. Randomly choose from the set of available CPUs and you have at least a fighting chance that you're not locked to an overloaded CPU.

Resources