How to obtain PMU events when running ARM bigLITTLE inside gem5 - arm

I'm running an ARM full system simulation in gem5 and the configurations I'm using in the commandline is:
./build/ARM/gem5.perf configs/example/arm/fs_bigLITTLE.py
--kernel=/home/ting-bazinga/gem5/linux-arm-gem5/vmlinux
--caches
--disk /home/ting-bazinga/gem5/fs_imgs/disks/aarch64-ubuntu-trusty-headless.img
--bootscript /home/ting-bazinga/gem5/fs_imgs/test.rcS
From the post Using perf_event with the ARM PMU inside gem5 I presume obtaining PMU events in gem5 is possible. However I didn't found the exact method for how to do that.
perf can be used to obtain PMU information, on my local machine I can just download the linux-tools-common in my terminal to use that tool. But I can't do the same with the simulation. There isn't a perf binary that I can just find online (or maybe anyone can give a hint of how to write this kind of binary?) And I also tried downloading the linux-tools-common package, copying it into the disk image then using the makefile to compile it. But somehow the makefile does not work in the simulated system.
Or can the PMU events be abtained using C code? In the post I mentioned above someone used C code to count the number of mispredicted branches by the branch predictor unit during a specific task. And I can use perf_event_open to obtain number of instruction during an execution. However running the perf_event_open code requires root, but I cannot use sudo in the simulated system.
Can anybody give me some instructions on how to obtain PMU events in gem5? Many thanks.

Related

How to use qemu to do profiling on a algorithm

I have a program run well on Ubuntu now. The program is written purely in C. And it will finally run on a embedded processor. I hope to know its execution speed on different target, like Cortex M3, M4 or A series. As there are pretty much double type arithmatic, the difference should be obvious. Currently, my idea is to use qemu to count the instruction executed for some set of data. As the program is only about data processing, the only required resource should be RAM.
I don't need the very accurate result, as it will only serve as a guide to choose CPU. Is there some easy guide for the task? I have little experience with qemu. I saw there are two ways to invoke qemu: qemu-system-arm and qemu-user. I guess the most accurate simulation result should be got by qemu-system-arm. What's more, Cortex M series should not support Linux due to lack of MMU, right?
There's not a lot out there on how to do this because it is in general pretty difficult to do profiling of guest code on an emulated CPU/system and get from that useful information about performance on real hardware. This is because performance on real hardware is typically strongly dependent on events which most emulation (and in particular QEMU) does not model, such as:
branch mispredictions
cache misses
TLB misses
memory latency
as well as (usually less significantly than the above) differences in number of cycles between instructions -- for instance on the Cortex-M4 VMUL.F32 is 1 cycle but VDIV.F32 is 14.
For a Cortex-M CPU the hardware is simple enough (ie no cache, no MMU) that a simple instruction count may not be too far out from real-world performance, but for an A-class core instruction count alone is likely to be highly misleading.
The other approach people sometimes want to take is to measure run-time under a model; this can be even worse than counting instructions, because some things that are very fast on real hardware are very slow in an emulator (eg floating point instructions), and because the JIT process introduces extra overhead at unpredictable times.
On top of the conceptual difficulties, QEMU is not currently a very helpful environment for obtaining information like instruction counts. You can probably do something with the TCG plugin API (if you're lucky one of the example plugins may be sufficient).
In summary, if you want to know the performance of a piece of code on specific hardware, the easiest and most accurate approach is to run and profile the code on the real hardware.
I post my solution here, in case someone just want a rough estimation as me.
Eclipse embedded CDT provides a good start point. You can start with a simple LED blink template. It support soft FP arithmatic only now. You can start qemu with the built embedded program, and a picture of the STM32F407 board will appear. The LED on the picture will blink as the program goes.
The key point is I can use the script from Counting machine instructions using gdb to count instruction on the qemu target.
However, it seems eclipse embedded cdt will stuck when some library code is executed. Here is my work around, start qemu mannually(the command is got by command 'ps' when eclipse start qemu):
In the first terminal:
qemu-system-gnuarmeclipse --verbose --verbose --board STM32F4-Discovery --mcu STM32F407VG --gdb tcp::1235 -d unimp,guest_errors --semihosting-config enable=on,target=native --semihosting-cmdline blinky_c
Then in the second terminal:
arm-none-eabi-gdb blinky_c.elf
and below is the command history I input in the gdb terminal
(gdb) show commands
1 target remote :1235
2 load
3 info register
4 set $sp = 0x20020000
5 info register
6 b main
7 c
Then you can use the gdb to count instruction as in Counting machine instructions using gdb.
One big problem with the method is the speed is really slow, as gdb will use stepi to go through all the code to be counted before get a result. It cost me around 3 hours in my ubuntu VMware machine to get 5.5M instruction executed.
One thing that you can do is use a simulation setup like the one used in this sample: https://github.com/swedishembedded/sdk/blob/main/samples/lib/control/dcmotor/src/main.c
This may look like an ordinary embedded application, but the data structure vdev actually resides in a different application running on the computer (in this case a dc motor simulator) and all reads and writes to it are automatically done over network by the simulator that runs this. The platform definition is here: https://github.com/swedishembedded/sdk/blob/main/samples/lib/control/dcmotor/boards/custom_board.repl This is how the structure is mapped.
From here it is not hard to implement advanced memory profiling by directly capturing reads and writes from the simulated application (which in this case is compiled for STM32 ARM).

Qemu: trace MMU operation

Currently I'm trying to run qemu-system-arm for armv7 architecture, do some initial setup for paging and then enable MMU.
I run qemu with gdb stub and connect to it then with gdb.
I must have screwed something up with translation tables/registers/etc., the thing is the minute I set MMU-enable bit in control register, gdb can't fetch data from memory anymore: after ni command which executes mmu-enable instruction it doesn't fetch next command and I can't access memory.
Is there any way to look what happens inside Qemu's MMU? Where it takes translation tables from, what calculates etc.
Or should I just recompile it with my additional debug output?
No, there's no way to trace this without modifying QEMU's sources yourself
So I did. For ARM architecture, the relevant code is found in target-arm/helper.c - get_phys_addr* functions.
No, there's no way to trace this without modifying QEMU's sources yourself to add debugging output. This is a specific case of a more general tendency, which is that QEMU's design and approach is largely "run correct code quickly", not to provide detailed introspection into the behaviour of possibly buggy guests. Sometimes, as in this case, there's a nice easy location to add your own debug printing; sometimes, as in the fairly common desire to print all the memory accesses made by the guest, there is nowhere in the C code where tracing can be put to catch all accesses.
When I used QEMU for debugging VM issues in an operating system kernel that I had built with a colleague, I ended up connecting GDB to debug QEMU instead (instead of debugging the guest process inside QEMU).
You can place breakpoints on the MMU table walking function and step through it.

Profiling on baremetal embedded systems (ARM)

I am wondering how you profile software on bare metal systems (ARM Cortex a8)? Previously I was using a simulator which had built-in benchmark statistics, and now I want to compare results from real hardware (running on a BeagleBoard-Xm).
I understand that you can use gprof, however I'm kind of lost as that assumes you have to run Linux on the target system?
I build the executable file with Codesourcery's arm-none-eabi cross-compiler and the target system is running FreeRTOS.
Closely evaluate what you mean by "profiling". You are indeed operating very close to bare metal, and it's likely that you will be required to take on some of the work performed by a tool like gprof.
Do you want to time a function call? or an ISR? How about toggling a GPIO line upon entering and exiting the code under inspection. A data logger or oscilloscope can be set to trigger on these events. (In my experience, a data logger is more convenient since mine could be configured to capture a sequence of these events - allowing me to compute average timings.)
Do you want to count the number of executions? The Cortex A8 comes equipped with a number of features (like configurable event counters) that can assist: link. Your ARM chip may be equipped with other peripherals that could be used, as well (depending on the vendor). Regardless, take a look at the above link - the new ARMs have lots of cool features that I don't get to play with as much as I would like! ;-)
I have managed to get profiling working for ARM Cortex M. As the GNU ARM Embedded (launchpad) tools do not come with profiling libraries included, I have added the necessary glue and profiling functionality.
References:
See http://mcuoneclipse.com/2015/08/23/tutorial-using-gnu-profiling-gprof-with-arm-cortex-m/
I hope this helps.

OPENOCD, flash program to ARM Cortex M0 (JTAG)

I'm new on OpenOCD, has anyone attempted to use Olimex OpenOCD to actually flash program hex file (from Kiel say) into ARM CORTEX M0 (generic).
What do I need to setup script file to take each word of the hex file to performs mww (memory write word) within the MCU flash?, can anyone provide an example. I use python.
I open for suggestion.
I use Window PC.
All Cortex M0 that I know of have no JTAG, but only SWD support. SWD is not yet available in OpenOCD - it is still in development.
Another note: The method for writing the flash memory is specific for each vendor/chip.
Sure, what platform in particular? some googling will find the exact sequence. flash unlock, erase, program, etc.
Section 6 of this page for example.
http://pygmy.utoh.org/riscy/cortex/led-lpc17xx.html
I am trying to figure out what board I did it on but those were pretty much the commands I followed and it worked just fine. It may have been the leaflabs maple mini. The steps are the same. To avoid the steps or scripting it, etc. what I ended up doing was writing a few lines of bootloader that said if ram+0 = 0x12345678, and ram+4 = 0x87654321 then branch to ram+8 else infinite loop. then it was trivial to use the jtag to load a program into ram with the two words and an entry point at 0x08 bytes into ram, press reset and run the program. On a cold power up it just hits the infinite loop. I spend my day on a bigger arm based system loading everything into ram using jtag so it made it quite comfortable. You could just script it in openocd and simply type the openocd command have the flash load happen.
Update for people stopping by...
You do not have to use mww, if you're just trying to flash-program (eg. upload your own code) to your microcontroller.
Some time ago, OpenOCD got a ("built-in") convenience-script, that you can use for programming, this "command" is called "program".
Here's an example from the documentation on the "program" command:
openocd -f interface/ftdi/jtag-lock-pick_tiny_2.cfg -f board/stm32f3discovery.cfg -c "program filename.elf verify reset"
-Replace "stm32f3discovery" by your board. If you use a different adapter, replace the interface with the appropriate configuration file.

How to monitor machine code calls by binary program

My goal is to record the number of processor instructions executed by a given binary program through the duration of its run. While it's easy to get the actual machine code from the source code (through gdb or any other disassembler), this does not take into account function calls and branches within the program that cause instructions to be executed more than once or skipped altogether.
Is there a straightforward solution to this?
This is very hardware specific, but most processors offer a facility that counts the exact number of machine instructions (and other events) that have flowed through them. That's how profilers work to capture things like cache misses: by querying these internal registers.
The PAPI library provides calls to query this data on a variety of major processors. If you're on Linux+x86, PerfSuite gives you some more high-level tools which may be easier to start with.
Intel has a monitor app you can use to watch the chip's internal counters in realtime, and their Performance Analysis Guide describes the various Performance Monitoring Units on the chip and how to read them.
If you're on Linux, you should be able to run your program through cachegrind to get instruction counts.
It may also be possible to use ollydbg's Run Trace function to obtain an instruction count, but that may be limited by memory.
Alternately, it is possible to write a small debugger that simply runs the program in single steps.
The raw tools for tracking system calls are platform specific.
Solaris: truss or dtrace
MacOS X: dtrace
Linux: strace
HP-UX: tusc
AIX: truss
Windows: ...
For example (Solaris):
truss -o ls.truss ls $HOME
This will capture all the system calls made by ls as it lists your home directory.
OTOH, this may not be what you're after...in which case it is of limited value.

Resources