Hardware Performance Counters using C [duplicate] - c

I'd like to use hardware performance counter, specifically x86 CPUs to obtain cache misses or branch mis-prediction. Performance counters are heavily used in advanced profilers like Intel VTune. Please don't be confused performance counters on Windows operating systems.
In order to use these counters in C/C++ program, one may use PAPI: http://icl.cs.utk.edu/papi/
This allows you to easily use performance counters, but on only Linux. PAPI once supported Windows, but not now.
Is there anyone who recently tried PAPI or other APIs to use hardware performance counters on Windows?

You can use RDPMC instruction or __readpmc MSVC compiler intrinsic, which is the same thing.
However, Windows prohibits user-mode applications to execute this instruction by setting CR4.PCE to 0. Presumably, this is done because the meaning of each counter is determined by MSR registers, which are only accessible in kernel mode. In other words, unless you're a kernel-mode module (e.g. a device driver), you are going to get "privileged instruction" trap if you attempt to execute this instruction.
If you're writing a user-mode application, your only option is (as #Christopher mentioned in comments) to write a kernel module which would execute this instruction for you (you'll incur user->kernel call penalty) and enable test signing on your machine so your presumably self-signed "driver" can be loaded. This means you can't easily distribute this app, but that'll work for in-house tuning.

What about this HCP Reference? Does it not provide what you want?

Related

__rdtsc/__rdtscp for ARM Mac M1/M2?

I want to insert some time measurement into my code. On x64 I use __rdtscp. Is there something similar for the mac m1/m2? Specifically something that isn't a system call and high resolution.
Just use clock_gettime(CLOCK_MONOTONIC,...)
It is a VDSO function. That means that the kernel injects code into the userspace program that "does the right thing", so the userspace program can access the time stamp counter without doing a syscall.
On x86, it will [usually] invoke rdtsc [or a PET], and adjust the counter value to represent nanoseconds.
On arm, the TSC is a control register, accessible only in kernel mode. But, higher end arm arches allow this to be mapped for R/O access by userspace. The kernel enables the mapping. Then, the VDSO snippet will know how to access the values via the mapping.
Calls to clock_gettime are fast. So fast that it's not worth trying to access the counter register directly.
Also, it's not terribly meaningful to access the counter directly, because we still have to convert it to some standard unit (e.g. nanoseconds). The VDSO snippet will do this.
UPDATE:
Is it a VDSO call on macOS, too? – 
fuz
My direct experience was with arm was on an nVidia Jetson [under linux].
But, AFAIK, macOS provides [has to provide] clock_gettime.
On older kernels, it may have to issue a syscall equivalent.
But, since the architecture provides the means to do the direct access for userspace to a given OS/kernel, there is every reason to believe the VDSO method is available under macOS as well. In fact, it does: https://www.unix.com/man-page/osx/7/vdso/
The way to see the specific mechanism is to build a program that uses clock_gettime and [using gdb] single step it a bit. Then, it is possible to have gdb disassemble the clock_gettime code.
We have to use gdb [vs. objdump and/or readelf] for the disassembly because the snippet is loaded/injected by the kernel dynamically, so it's not easily accessible with static analysis.
Further, the injected code can be processor model specific. The kernel probes the CPU arch and its features during boot. It crafts the snippet based on the features it finds.
Using gdb is how I examined clock_gettime [about 3 years ago for a commercial product], to verify that it would access the H/W without a syscall and that it provided the correct nanosecond values. In that particular case, I also looked at the arch specific sections in the kernel source code.

How to determine if ARM processor running in a usual locked-down "world" or in Secore "world"?

For example, virt-what shows if you are running inside hardware virtualization "sandbox".
How to detect if you are running in ARM "TrustZone" sandbox?
TrustZone maybe different than what you think. There is a continuum of modes. From 'a simple API of trusted functions' to 'dual OSs' running in each world.
If there was more context given to the question, it would be helpful. Is this for programatically determining or for reverse engineering considerations? For the current Linux user-space, the answer is no.
Summary
No current user space utility.
Time based analysis.
Code based analysis.
CPU exclusion and SCR.
ID_PRF1 bits [7:4].
virt-what is not a fool-proof way of discovering if you are running under a hyper-visor. It is a program written for linux user-space. Mostly, these are shell scripts which examine /proc/cpuinfo, etc. procfs is a pseudo-file system which runs code in the kernel context and reports to user space. There is no such detection of TrustZone in the main line ARM linux. By design, ARM has made it difficult to detect. An design intent is to have code in the normal world run unmodified.
Code analysis
In order to talk to the secure world, the normal world needs SMC instructions. If your user space has access to kernel code or the vmlinux image, you can try to analyze the code sections for an SMC instruction. However, this code maybe present in the image, but never active. At least this says whether the Linux kernel has some support for TrustZone. You could write a kernel module which would trap any execution of an SMC instruction, but there are probably better solutions.
Timing analysis
If an OS is running in the secure world, some time analysis would show that some CPU cycles have been stolen if frequency scaling is not active. I think this is not an answer in the spirit of the original question. This relies on knowing that the secure world is a full-blown OS with a timer (or at least pre-emptible interrupts).
CPU exclusion and SCR
The SCR (Secure configuration register) is not available in the normal world. From the ARM Cortex-A5 MPcore manual (pg4-46),
Usage constraints The SCR is:
• only accessible in privileged modes
• only accessible in Secure state.
An attempt to access the SCR from any state other than secure privileged
results in an Undefined instruction exception.
ID_PRF1 bits [7:4].
On some Cortex-A series, the instruction,
mrc p15, 0, r0, c0, c1, 1
will get a value where bits [7:4] indicate whether the CPU supports Security Extensions, also known as TrustZone. A non-zero value indicates it is supported. Many early CPUs may not support this CP15 register . So, it is much like the SCR and handling the undefined instruction. Also, it doesn't tell you that code is active in the TrustZone mode.
Summary
It is possible that you could write a kernel module which would try this instruction and handle the undefined exception. This would detect a normal versus secure world. However, you would have to exclude CPUs which don't have TrustZone at all.
If the device is not an ARMv6 or better, then TrustZone is impossible. A great deal of Cortex-A devices have TrustZone in the CPU, but it is not active.
The combined SMC test and a CPU id, is still not sufficient. Some boot loaders run in the secure world and then transition to the normal world. So secure is only active during boot.
Theoretically, it is possible to know, especially with more knowledge of the system. There maybe many signs, such as spurious interrupts from the GIC, etc. However, I don't believe that any user space linux tool exists as of Jan 2014. This is a typical war of escalation between virus/rootkit writers and malware detection software.TZ Rootkits
You have not specified any details of the processor (A8, A9, A15?) or the execution mode (user/kernel/monitor) from where you want to detect the processor state.
As per the ARM documentation, the current state of the processor as Secure (aka the TrustZone sandbox) or Non-secure can be detected by reading the Secure Configuration Register and checking for the NS bit.
To access the Secure Configuration Register: MRC p15, 0, <Rd>, c1, c1, 0
Bit 0 being set corresponds to the processor being in non-secure mode and vice-versa.
You can check the processor's datasheet, and find those registers which behaves different between Normal world and Secure world. Generally, in Secure World, when you read these registers you will just get null. But get data in Normal world. And also, some registers that you can just access in Secure world, if you are in Secure World, you can access it, but in Normal World your access will be rejected.
Any way, there are many ways to distinguish Normal World and Secure World. JUST READ THE DATASHEET IN DETAIL.

Can you check performance of a program running with Qemu Simulator?

Say if I am running an ARM simulator using Qemu, is it possible to find the time of execution of a program as it would be on the real ARM processor. In other words if I use functions such as gettimeofday, in a program running on the simulator, to check the elapsed time, will the elapsed time be given accurately through the cycle-accurate simulation?
Investigation in this issue at our company concluded that Qemu (for the ARM) is not cycle accurate. If I remember correctly cycle accuracy is not a goal of Qemu, instead it aims at fast emulation. Beware also that exact timing is dependent on quite unpredictable things like cache hits and misses. It will also depend on the actual architecture chosen. Note that ARM is merely an instruction set IP and several different implementations exist. If in addition an operating system is emulated, things get even more unpredictable.
We use the simulator from ARM to evaluate performance, but even that one is not fully cycle accurate for the latest versions of the ARM architecture.
GEM5
I have seen a researcher use gem5 for this. This paper evaluates how accurate it is. And I've created an easy to get started setup on GitHub.
As Bryan mentioned QEMU is designed for speed: only a valid x86 API behavior must be reached, not necessarily with the right number of cycles or in the same pipeline order. This is also called functional emulation.
Furthermore, DRAM memory accesses are assumed to be immediate, and therefore it makes no sense to emulate caches either. And as we know, current CPUs are basically memory latency hiding machines.
Cycle accurate emulators on the other hand, also emulate CPU internals, and are therefore way slower.
The root of the problem is of course the under documented performance features of processors, which vendors don't release to prevent intellectual property leakage.
GEM5 appears to implement a generic version of common CPU internals, so it should be more cycle accurate than functional emulators, but true cycle accurate emulation is likely impossible without insider knowledge.
Third party emulation implementors must then reverse engineer CPU performance from experiments and existing documentation.
Some of the key "internals" are cache, pipeline and branch prediction.
Related:
Question that asks how cycle accurate emulators are possible at all: How can CAS simulators like PTLsim achieve cycle accurate simulation of x86 hardware?
ARM Cycle-Accurate Simulator

To what extent are interrupts supported in Win32?

To what extent are interrupts supported in Win32 beyond processor definitions? For example, x86 machines define at least 18 interrupts, including traps such as the breakpoint trap (INT 3). The other 19-255 interrupts are left open by Intel as software defined interrupts. Are any of these used by Windows/WinAPI or are they just open and free for applications to use as they please? If Windows uses them, where can I find the relevant documentation? I looked on MSDN and could not find anything.
(BTW I am doing compiler, debugger and other system-level programming, so please don't lecture me on your opinions about the advisability of using interrupts in the first place.)
In Win32 apps, there's probably just one interrupt used commonly, int 2Eh. It's used as the system call entry point. It's analogous to int 21h in DOS. The rest of the interrupts aren't used by apps.
Apps, however, can handle some CPU exceptions (and debug breaks) via Structured Exception Handling (SEH)/Vectored Exception Handling (VEH). Windows catches CPU exceptions originating in apps and reflects them back into the apps, if and however possible (Windows is not perfect in imitating the CPU exception model).
Windows uses device interrupts internally and does not let apps mess with them. The x86 CPU handles interrupts in the most privileged mode, where the kernel runs.
Nowadays many device interrupts aren't associated with fixed interrupt vectors and are configurable and you need to work with the various things like PCI to query or change the settings.
If you want to work with devices and interrupts directly, you need to write a kernel-mode driver for Windows. There's the Device Driver Kit (DDK) and books like Windows Internals that can get you started.
Still, if you're looking for specifics of device XYZ and its interrupt programming, you aren't going to find everything or much on MSDN or in the DDK because you'll need hardware-specific information, something that's outside of Microsoft's control. The kernel provides the functionality necessary to do I/O and handle interrupts, but it's ultimately up to device drivers to use them one way or the other.

Any open-source ARM7 emulators suitable for linking with C?

I have an open-source Atari 2600 emulator (Z26), and I'd like to add support for cartridges containing an embedded ARM processor (NXP 21xx family). The idea would be to simulate the 6507 until it tries to read or write a byte of memory (which it will do every 841ns). If the 6507 performs a write, put the address and data on some of the ARM's I/O ports and let the ARM code run 20 cycles, confirm that the ARM is floating its data bus, and let the ARM run for another 38 cycles. If the 6507 performs a read, put the address on the ARM's I/O ports, let the ARM run 38 cycles, grab the data from the ARM's I/O port (hopefully the ARM software will have put it there), and let the ARM run another 20 cycles.
The ARM7 seems pretty straightforward to implement; I don't need to simulate a whole lot of hardware features. Any thoughts?
Edit
What I have in mind would be a routine that would take as a parameter a struct holding the machine state and pointers to a memory access routine. When called, the routine would emulate the ARM's instruction engine, generating appropriate reads, writes, and code fetches. I could then write the memory access routine to regard appropriate areas as flash (with roughly-approximated wait states), RAM, I/O ports, and timer registers. Some other areas would be marked as don't-care, and accesses to any other areas would flag an error and stop the emulator.
Perhaps QEMU uses such a thing internally. Since the ARM emulation would be integrated into an already-existing emulation engine (which I didn't write and don't fully understand--the only parts of Z26 I've patched have been the memory read/write logic) I would need something with a fairly small footprint.
Any idea how QEMU works inside? Any idea what the GPL licence would require if I just use 2% of the code in QEMU--whether I'd have to bundle the code for the whole thing, or just the part that I use, or what?
Try QEMU.
With some work, you can make my emulator do what you want. It was written for ARM920, and the Thumb instruction set isn't done yet. Neither is the MMU/cache interface. Also, it's slow because it is an interpreter. On the bright side, it's all written in C99.
http://code.google.com/p/gp2xemu/
I haven't worked on it for a while (The svn trunk is 2 years old), but if you're going to use the code, I'll be glad to help you out with the missing features. It is licensed under MIT, so it's just the same as the broad BSD license.

Resources