What is the best way to profile and optimize clutter-box2d application on an arm target?
I have tried using valgrind to profile the code on x86 before porting, but it doesnt seem to help. Ported application still runs considerably slow on ARM target.
I wasn't able to get valgrind working properly on arm target to profile and identify bottlenecks.
Used a bit of Oprofile but it gives a system wide snapshot and doesnt do much good. Since it does not produce call-graphs.
If all things fail (and you are on a glibc-based system) you can go the traditional route and use gprof to collect profiling data.
http://en.wikipedia.org/wiki/Gprof
Related
I have a program run well on Ubuntu now. The program is written purely in C. And it will finally run on a embedded processor. I hope to know its execution speed on different target, like Cortex M3, M4 or A series. As there are pretty much double type arithmatic, the difference should be obvious. Currently, my idea is to use qemu to count the instruction executed for some set of data. As the program is only about data processing, the only required resource should be RAM.
I don't need the very accurate result, as it will only serve as a guide to choose CPU. Is there some easy guide for the task? I have little experience with qemu. I saw there are two ways to invoke qemu: qemu-system-arm and qemu-user. I guess the most accurate simulation result should be got by qemu-system-arm. What's more, Cortex M series should not support Linux due to lack of MMU, right?
There's not a lot out there on how to do this because it is in general pretty difficult to do profiling of guest code on an emulated CPU/system and get from that useful information about performance on real hardware. This is because performance on real hardware is typically strongly dependent on events which most emulation (and in particular QEMU) does not model, such as:
branch mispredictions
cache misses
TLB misses
memory latency
as well as (usually less significantly than the above) differences in number of cycles between instructions -- for instance on the Cortex-M4 VMUL.F32 is 1 cycle but VDIV.F32 is 14.
For a Cortex-M CPU the hardware is simple enough (ie no cache, no MMU) that a simple instruction count may not be too far out from real-world performance, but for an A-class core instruction count alone is likely to be highly misleading.
The other approach people sometimes want to take is to measure run-time under a model; this can be even worse than counting instructions, because some things that are very fast on real hardware are very slow in an emulator (eg floating point instructions), and because the JIT process introduces extra overhead at unpredictable times.
On top of the conceptual difficulties, QEMU is not currently a very helpful environment for obtaining information like instruction counts. You can probably do something with the TCG plugin API (if you're lucky one of the example plugins may be sufficient).
In summary, if you want to know the performance of a piece of code on specific hardware, the easiest and most accurate approach is to run and profile the code on the real hardware.
I post my solution here, in case someone just want a rough estimation as me.
Eclipse embedded CDT provides a good start point. You can start with a simple LED blink template. It support soft FP arithmatic only now. You can start qemu with the built embedded program, and a picture of the STM32F407 board will appear. The LED on the picture will blink as the program goes.
The key point is I can use the script from Counting machine instructions using gdb to count instruction on the qemu target.
However, it seems eclipse embedded cdt will stuck when some library code is executed. Here is my work around, start qemu mannually(the command is got by command 'ps' when eclipse start qemu):
In the first terminal:
qemu-system-gnuarmeclipse --verbose --verbose --board STM32F4-Discovery --mcu STM32F407VG --gdb tcp::1235 -d unimp,guest_errors --semihosting-config enable=on,target=native --semihosting-cmdline blinky_c
Then in the second terminal:
arm-none-eabi-gdb blinky_c.elf
and below is the command history I input in the gdb terminal
(gdb) show commands
1 target remote :1235
2 load
3 info register
4 set $sp = 0x20020000
5 info register
6 b main
7 c
Then you can use the gdb to count instruction as in Counting machine instructions using gdb.
One big problem with the method is the speed is really slow, as gdb will use stepi to go through all the code to be counted before get a result. It cost me around 3 hours in my ubuntu VMware machine to get 5.5M instruction executed.
One thing that you can do is use a simulation setup like the one used in this sample: https://github.com/swedishembedded/sdk/blob/main/samples/lib/control/dcmotor/src/main.c
This may look like an ordinary embedded application, but the data structure vdev actually resides in a different application running on the computer (in this case a dc motor simulator) and all reads and writes to it are automatically done over network by the simulator that runs this. The platform definition is here: https://github.com/swedishembedded/sdk/blob/main/samples/lib/control/dcmotor/boards/custom_board.repl This is how the structure is mapped.
From here it is not hard to implement advanced memory profiling by directly capturing reads and writes from the simulated application (which in this case is compiled for STM32 ARM).
I am trying to use a profiler on DS-5 Simulator. I dont want to attach any boards at this time and hence I believe I cannot use the Streamline Analyzer.
My question is how can I see code coverage and cycle count usage on DS-5 Simulator (Cortex A8) on Windows in Eclipse environment.
Thanks
The simulators that ARM provides are called Fast Models (or fixed virtual platforms, depending on the degree of flexibility you want). They are not designed to be cycle accurate, but are instead instruction accurate. Basically, they give you a "programmer's view" of a system.
For this reason, running Streamline on a model in order to optimize an application or system wouldn't give you a realistic performance profile.
Hope this helps!
I'm a high-school student doing some C things where I'd like to profile my code to see where the actual performance bottlenecks are. I don't have much money, so I'd prefer free tools.
I like to use the MinGW/GCC compiler toolchain. This is not something I'm stuck with, but I'd prefer tools that are capable of working with this.
Features I need:
See how much total time is spent in a certain function.
Features I'd like:
See how much time a line of code takes.
Cross-platform (being able to use the same software on Linux & Mac)
See how often a function gets called (and how long each call takes on average).
See what causes the time spent (cache misses, branch mispredictions, etc).
I've tried using gprof, but I couldn't get it to work (it only shows main in the profile), and I've heard bad things about it, so what are my options?
if you want a free, Windows and Linux TBP (it also does event based and some other metric based forms of profiling) then AMD's code analyst should do the job nicely (even on Intel cpus, though Im not sure of the quality/reliability of the branching and cache analysis on Intel cpus), its also got a nice ui built in Qt which does the source + assembly line time breakdowns. its also got an API to embed events for the profiler to catch for more targeted profiling.
I am wondering how you profile software on bare metal systems (ARM Cortex a8)? Previously I was using a simulator which had built-in benchmark statistics, and now I want to compare results from real hardware (running on a BeagleBoard-Xm).
I understand that you can use gprof, however I'm kind of lost as that assumes you have to run Linux on the target system?
I build the executable file with Codesourcery's arm-none-eabi cross-compiler and the target system is running FreeRTOS.
Closely evaluate what you mean by "profiling". You are indeed operating very close to bare metal, and it's likely that you will be required to take on some of the work performed by a tool like gprof.
Do you want to time a function call? or an ISR? How about toggling a GPIO line upon entering and exiting the code under inspection. A data logger or oscilloscope can be set to trigger on these events. (In my experience, a data logger is more convenient since mine could be configured to capture a sequence of these events - allowing me to compute average timings.)
Do you want to count the number of executions? The Cortex A8 comes equipped with a number of features (like configurable event counters) that can assist: link. Your ARM chip may be equipped with other peripherals that could be used, as well (depending on the vendor). Regardless, take a look at the above link - the new ARMs have lots of cool features that I don't get to play with as much as I would like! ;-)
I have managed to get profiling working for ARM Cortex M. As the GNU ARM Embedded (launchpad) tools do not come with profiling libraries included, I have added the necessary glue and profiling functionality.
References:
See http://mcuoneclipse.com/2015/08/23/tutorial-using-gnu-profiling-gprof-with-arm-cortex-m/
I hope this helps.
I want to run a simple hello world, written in c, app.
on my at91sam9rl-ek.
Is it possible without an os?
And (if it is) how do I have to compile it?
-right now I try using g++ lite for creating arm code
(In general which programms can the board start without OS,
assembler, arm code?)
Sure, no problem running without an operating system, I do that kind of thing daily...
http://sam7stuff.blogspot.com/
You programs are, at least at first, not going to resemble desktop applications, I would avoid any libraries C libraries, no printfs or strcmps or things like that until you get the feel for it and find the right tools. No floating point as well. add some numbers do some shifting blink some leds.
codesourcery lite is probably the fastest way to get started, the gnueabi one I believe is the one you want.
This winarm site has a compiler and tons of non-os projects for seems like every arm based microcontroller.
http://www.siwawi.arubi.uni-kl.de/avr_projects/arm_projects/
Atmel is very very good about information, no doubt they have example programs you can try as well on the eval board.
emdebian is another cross compiler that is somewhat up to date and has binaries. building a gcc from scratch for cross compiling is not bad at all. The C library is another story though, and even the gcc library for that matter. I find it easier to do without either library.
It is possible get a C library working and run a great many kinds of programs. Depends on what you are looking to do. Ahh, just looked at the specs, that is a pretty serious eval board, plenty of power for an operating system should you choose to run one. You can certainly run programs that use the display as a user interface. read/write sd cards, usb, basically everything on the board, without an os, if you choose.