How to monitor machine code calls by binary program - c

My goal is to record the number of processor instructions executed by a given binary program through the duration of its run. While it's easy to get the actual machine code from the source code (through gdb or any other disassembler), this does not take into account function calls and branches within the program that cause instructions to be executed more than once or skipped altogether.
Is there a straightforward solution to this?

This is very hardware specific, but most processors offer a facility that counts the exact number of machine instructions (and other events) that have flowed through them. That's how profilers work to capture things like cache misses: by querying these internal registers.
The PAPI library provides calls to query this data on a variety of major processors. If you're on Linux+x86, PerfSuite gives you some more high-level tools which may be easier to start with.
Intel has a monitor app you can use to watch the chip's internal counters in realtime, and their Performance Analysis Guide describes the various Performance Monitoring Units on the chip and how to read them.

If you're on Linux, you should be able to run your program through cachegrind to get instruction counts.
It may also be possible to use ollydbg's Run Trace function to obtain an instruction count, but that may be limited by memory.
Alternately, it is possible to write a small debugger that simply runs the program in single steps.

The raw tools for tracking system calls are platform specific.
Solaris: truss or dtrace
MacOS X: dtrace
Linux: strace
HP-UX: tusc
AIX: truss
Windows: ...
For example (Solaris):
truss -o ls.truss ls $HOME
This will capture all the system calls made by ls as it lists your home directory.
OTOH, this may not be what you're after...in which case it is of limited value.

Related

How to use qemu to do profiling on a algorithm

I have a program run well on Ubuntu now. The program is written purely in C. And it will finally run on a embedded processor. I hope to know its execution speed on different target, like Cortex M3, M4 or A series. As there are pretty much double type arithmatic, the difference should be obvious. Currently, my idea is to use qemu to count the instruction executed for some set of data. As the program is only about data processing, the only required resource should be RAM.
I don't need the very accurate result, as it will only serve as a guide to choose CPU. Is there some easy guide for the task? I have little experience with qemu. I saw there are two ways to invoke qemu: qemu-system-arm and qemu-user. I guess the most accurate simulation result should be got by qemu-system-arm. What's more, Cortex M series should not support Linux due to lack of MMU, right?
There's not a lot out there on how to do this because it is in general pretty difficult to do profiling of guest code on an emulated CPU/system and get from that useful information about performance on real hardware. This is because performance on real hardware is typically strongly dependent on events which most emulation (and in particular QEMU) does not model, such as:
branch mispredictions
cache misses
TLB misses
memory latency
as well as (usually less significantly than the above) differences in number of cycles between instructions -- for instance on the Cortex-M4 VMUL.F32 is 1 cycle but VDIV.F32 is 14.
For a Cortex-M CPU the hardware is simple enough (ie no cache, no MMU) that a simple instruction count may not be too far out from real-world performance, but for an A-class core instruction count alone is likely to be highly misleading.
The other approach people sometimes want to take is to measure run-time under a model; this can be even worse than counting instructions, because some things that are very fast on real hardware are very slow in an emulator (eg floating point instructions), and because the JIT process introduces extra overhead at unpredictable times.
On top of the conceptual difficulties, QEMU is not currently a very helpful environment for obtaining information like instruction counts. You can probably do something with the TCG plugin API (if you're lucky one of the example plugins may be sufficient).
In summary, if you want to know the performance of a piece of code on specific hardware, the easiest and most accurate approach is to run and profile the code on the real hardware.
I post my solution here, in case someone just want a rough estimation as me.
Eclipse embedded CDT provides a good start point. You can start with a simple LED blink template. It support soft FP arithmatic only now. You can start qemu with the built embedded program, and a picture of the STM32F407 board will appear. The LED on the picture will blink as the program goes.
The key point is I can use the script from Counting machine instructions using gdb to count instruction on the qemu target.
However, it seems eclipse embedded cdt will stuck when some library code is executed. Here is my work around, start qemu mannually(the command is got by command 'ps' when eclipse start qemu):
In the first terminal:
qemu-system-gnuarmeclipse --verbose --verbose --board STM32F4-Discovery --mcu STM32F407VG --gdb tcp::1235 -d unimp,guest_errors --semihosting-config enable=on,target=native --semihosting-cmdline blinky_c
Then in the second terminal:
arm-none-eabi-gdb blinky_c.elf
and below is the command history I input in the gdb terminal
(gdb) show commands
1 target remote :1235
2 load
3 info register
4 set $sp = 0x20020000
5 info register
6 b main
7 c
Then you can use the gdb to count instruction as in Counting machine instructions using gdb.
One big problem with the method is the speed is really slow, as gdb will use stepi to go through all the code to be counted before get a result. It cost me around 3 hours in my ubuntu VMware machine to get 5.5M instruction executed.
One thing that you can do is use a simulation setup like the one used in this sample: https://github.com/swedishembedded/sdk/blob/main/samples/lib/control/dcmotor/src/main.c
This may look like an ordinary embedded application, but the data structure vdev actually resides in a different application running on the computer (in this case a dc motor simulator) and all reads and writes to it are automatically done over network by the simulator that runs this. The platform definition is here: https://github.com/swedishembedded/sdk/blob/main/samples/lib/control/dcmotor/boards/custom_board.repl This is how the structure is mapped.
From here it is not hard to implement advanced memory profiling by directly capturing reads and writes from the simulated application (which in this case is compiled for STM32 ARM).

Will static linking allow cross platform execution?

I am curious about how statically linked C executables would work in different environments. Lets say we compile our C code to target x86 MacOs and we statically include everything it uses in the executable as well (print, strlen). What really stops this executable from running in a Windows OS if we include every library it needs? I understand the file format could be different and break but other than that would this technically be able to run?
I see where you're coming from, operating systems makes us think as programmers that libraries are the be-all end-all of programming, that a call to a library is all you need to make complex things happen and that everything is contained in them.
But the truth is, libraries mostly provide provide an abstraction layer. As an exemple let's create a library called "hello_world.so" which prints "Hello World!" to the console. That library we created relies on stdio to handle the complex I/O stuff but stdio itself depends on at least one other thing: the kernel (some specific targets work without a kernel but these system are outside the scope of this answer).
In the desktop world, things can get really complicated, we have several hundreds of processes all running at once even in an idle system, all these apps need access to the hardware (possibly at once too) so it was decided a controller was needed, some piece of software that would coordinate all other software running on the same computer. This piece of software is usually called a kernel. On Windows it's the NT kernel, on macOS it's the XNU and on Linux it's... the Linux kernel!
On these systems, the biggest job of a library is to abstract the kernel, to make us believe printing text on a Linux or a Windows console works the exact same way when actually it can be completely different! Libraries like stdio/time/etc have different "implementations" but the same "interface": they look the same from the dev point of view but the way they achieve their goals can vary wildy, they can do conversions, calls to other hidden or non hidden functions... All this is completely portable from one OS to the other though, things start to go south for you idea when kernel calls start to show up.
Kernel calls are ways a program can "talk" to the kernel. They can be used to do literally thousands of different things but for example there's one (or several ones) to ask for memory (usually this is called with malloc), one to print to the console, one to ask if a network packet arrived, on to ask to talk to your GPU... And these system calls are completely different from one kernel to the other, sometimes even for two versions of the same kernel!
These "kernel calls" are the only thing preventing you from running statically-compiled linux programs on Windows.
PS: Even though all the above is completely true and kernels can be as different from one another as they wish, due to the history of kernels and of computing in general, some kernels actually share the same interface (even though their implementation as you guessed, can be nothing alike). The best example I know of is how most kernels I know of are based on the UNIX kernel.
It means that -even though I have never tested it myself- I think you should be able to statically link a Linux app and use it on Linux, most BSDs and possibly even macOS
The binary and libraries are specific to the operating system.
The TLDR is that the linking process translates function calls into adresses that points of the Operating System's specific libraries. There are some differences like alignment that happens at compile time, but the responsible for your x86 instructions not running under a different OS is the linker.
Your compiler produces x86 instructions that are ready to execute but is incomplete. The linker will go into every function call give a adress for that function in the executable file, even for functions of the standard library.
The linker will create the executable file following a file format which have a header with information like size, metadata and entry point.Windows and Unix have differents specifications for executable files. Windows has PE and Unix has ELF format both for executables and libraries.
Through some hacks and non-trivial tricks it is possible to create an executable that can run on Windows and Unix (see αcτµαlly pδrταblε εxεcµταblε for how).
But even if you do all that there is an obstacle that can not be circumvented: the kernel. The kernel is, well, a kernel. It's the most important thing in
a OS and it provides a set of API calls that provides basic and low-level access to computer resources, so functions like malloc are implemented using the kernel specific API call, VirtualAlloc for Windows and vmalloc or mmap for POSIX.
Main Answer
If your program does anything useful (print output, return a value, communicate on the network), it contains some form of system-call instruction. Each system-call instruction is a request to a particular operating system, and macOS system calls will not work on Windows and vice-versa.
The system-call instruction sends information to the operating system, including a number identifying which service is requested. The operating system that performs that service. When you build your program for macOS, it includes library routines that contain system-call instructions. If you execute those instructions on a Microsoft Windows system, Windows will not understand the macOS requests. It will interpret the information differently, and the program will not work.
So, in theory, there is nothing preventing you from writing your own program loader that reads an ELF executable file intended for macOS, loading its contents into memory, and transferring control to its entry point. But the program will not work because of the system calls.
Supplement
You might consider translating all the system calls in the program. Changing the primary number that identifies the service request might be feasible; it might not require changing the executable too much. For example, if 37 is a “write to file” request on macOS, your program loader might change it to 48 on Windows. However, the system calls also require other data be passed, such as pointers to buffers, lengths, and so on, and there are likely many discrepancies in how those are passed, so that macOS requests cannot be easily translated into Windows requests. Also, it can be technically challenging to identify all places in a program that a certain instruction is used—some of the contents of memory of a loaded program are instructions and some are data. Most normal programs may be well-behaved and easy to analyze in this regard, but not all are.
Another potential issue is that programs may expect to have certain modes set in the processor, and the host operating system may or may not have set those modes as needed.

How to obtain PMU events when running ARM bigLITTLE inside gem5

I'm running an ARM full system simulation in gem5 and the configurations I'm using in the commandline is:
./build/ARM/gem5.perf configs/example/arm/fs_bigLITTLE.py
--kernel=/home/ting-bazinga/gem5/linux-arm-gem5/vmlinux
--caches
--disk /home/ting-bazinga/gem5/fs_imgs/disks/aarch64-ubuntu-trusty-headless.img
--bootscript /home/ting-bazinga/gem5/fs_imgs/test.rcS
From the post Using perf_event with the ARM PMU inside gem5 I presume obtaining PMU events in gem5 is possible. However I didn't found the exact method for how to do that.
perf can be used to obtain PMU information, on my local machine I can just download the linux-tools-common in my terminal to use that tool. But I can't do the same with the simulation. There isn't a perf binary that I can just find online (or maybe anyone can give a hint of how to write this kind of binary?) And I also tried downloading the linux-tools-common package, copying it into the disk image then using the makefile to compile it. But somehow the makefile does not work in the simulated system.
Or can the PMU events be abtained using C code? In the post I mentioned above someone used C code to count the number of mispredicted branches by the branch predictor unit during a specific task. And I can use perf_event_open to obtain number of instruction during an execution. However running the perf_event_open code requires root, but I cannot use sudo in the simulated system.
Can anybody give me some instructions on how to obtain PMU events in gem5? Many thanks.

Create a Debugger using C

I have been asked to write a program in C which should debug another program in C language and then store the value of each variable of every line,loop or function in a log file.
I have been searching over the internet and I found articles on debugging using gdb.
Can I somehow use GDB in my program for this purpose and then store the values of each variable line by line.
I've got basic knowledge of C/C++ so please reply in simple terms.
Thanks
Debuggers depend on some special capability of the hardware, which must be exposed by the operating system (if any).
The basic idea is that the hardware is configured to transfer control to a debugger stub either after every instruction of the target program, or after certain types of instructions such system calls, or those meeting a hardware breakpoint condition. Typically this would look like an interrupt, supervisor exception, or the like - very platform-specific detail.
As mentioned in comments, on Linux, you use the ptrace functionality of the kernel to interact with the debugger support provided by the hardware and the kernel, abstracting away a lot of the hardware-unique detail and managing the permission issues. Typically you must either be the same user id as the process being debugged, or be the superuser (root). Linux's ptrace also gives you an indirect ability to do to things like access the memory (literally, address space) of the target application, something critical to debugger functionality which you cannot ordinarily do from another user-mode program on a multitasking operating system.
Other operating systems will have different methods. Some embedded targets use debug pods which connect your development machine to the embedded board by a few wires. In other cases, debug capability built into the hardware is managed by a small program running on the target processor, which then talks back over a serial or network port to the full debugger program residing on the development machine.
A program such as GDB can do more than just the basics of setting debug stop conditions, dumping registers, and dumping program instructions. Much of its code deals with annotating what it displays based on debug metadata optionally left behind by compilers, walking back through stack frames, and giving the user powerful tools to configure all of this - and of course it does most of this in a target-independent way, with the target-unique code mostly confined to a few interchangeable directories.
You can indeed "drive" GDB from another program - many, many GUI type debuggers do exactly that, existing as graphical front ends for GDB. However, if you were assigned to write a debugger, doing it that way may or may not by consistent with your assignment.

How to profile thread load balancing?

I need to see the load balancing characteristics of my multithreaded program. Is there any tool that will give me the information to, e.g. plot this? I need something simple that will give me information per core, for example, but not Intel VTune and the such... that is so bloated it hurts to even look at it.
Take a look at Linux Trace Toolkit - next generation, you can also use Gnu gprof it's not sexy but it do the job :)
EDIT :
You can use gprof in threaded environment : Using gprof with pthreads
EDIT2 : Oprofile may help also
I've only scratched the surface of the capabolities of AMD's CodeAnalyst but what I have found so far is impressive, especially all the performance counters and getting them into the detailed picture. As to per-thread profiling, I mostly write massively parallel applications running for extended periods of time on dedicated cores which may not be applicable for your stuff.
It appears quite stingy with respect to its own CPU needs. I don't know if it will profile on intel CPUs. There is a Linux version.
Give it a spin!
You can also use perf, the official implementation for supporting performance counters in the Linux kernel. In addition to reading performance counters, it also allows to access some other metrics such as context switches, CPU migrations, page faults, etc.
Unfortunately the official wiki does not contain too much information. But you can check this page for more information on how to use the different tools included in perf.
For researching subject I've used the following command:
ps -AL -o lwp,fname,psr | grep ammp
The application under study was ammp, it uses the same number of threads than cores. The command returns in which core was each thread. Executing this command several times you will see how a given thread moves through the cores and how the load balancing algorithm works.
I hope you find useful.

Resources