Cycle Count Profiling on ARM DS-5 Simulator

Cycle Count Profiling on ARM DS-5 Simulator - arm

I am trying to use a profiler on DS-5 Simulator. I dont want to attach any boards at this time and hence I believe I cannot use the Streamline Analyzer.
My question is how can I see code coverage and cycle count usage on DS-5 Simulator (Cortex A8) on Windows in Eclipse environment.
Thanks

The simulators that ARM provides are called Fast Models (or fixed virtual platforms, depending on the degree of flexibility you want). They are not designed to be cycle accurate, but are instead instruction accurate. Basically, they give you a "programmer's view" of a system.
For this reason, running Streamline on a model in order to optimize an application or system wouldn't give you a realistic performance profile.
Hope this helps!

Related

How to use qemu to do profiling on a algorithm

I have a program run well on Ubuntu now. The program is written purely in C. And it will finally run on a embedded processor. I hope to know its execution speed on different target, like Cortex M3, M4 or A series. As there are pretty much double type arithmatic, the difference should be obvious. Currently, my idea is to use qemu to count the instruction executed for some set of data. As the program is only about data processing, the only required resource should be RAM.
I don't need the very accurate result, as it will only serve as a guide to choose CPU. Is there some easy guide for the task? I have little experience with qemu. I saw there are two ways to invoke qemu: qemu-system-arm and qemu-user. I guess the most accurate simulation result should be got by qemu-system-arm. What's more, Cortex M series should not support Linux due to lack of MMU, right?

There's not a lot out there on how to do this because it is in general pretty difficult to do profiling of guest code on an emulated CPU/system and get from that useful information about performance on real hardware. This is because performance on real hardware is typically strongly dependent on events which most emulation (and in particular QEMU) does not model, such as:
branch mispredictions
cache misses
TLB misses
memory latency
as well as (usually less significantly than the above) differences in number of cycles between instructions -- for instance on the Cortex-M4 VMUL.F32 is 1 cycle but VDIV.F32 is 14.
For a Cortex-M CPU the hardware is simple enough (ie no cache, no MMU) that a simple instruction count may not be too far out from real-world performance, but for an A-class core instruction count alone is likely to be highly misleading.
The other approach people sometimes want to take is to measure run-time under a model; this can be even worse than counting instructions, because some things that are very fast on real hardware are very slow in an emulator (eg floating point instructions), and because the JIT process introduces extra overhead at unpredictable times.
On top of the conceptual difficulties, QEMU is not currently a very helpful environment for obtaining information like instruction counts. You can probably do something with the TCG plugin API (if you're lucky one of the example plugins may be sufficient).
In summary, if you want to know the performance of a piece of code on specific hardware, the easiest and most accurate approach is to run and profile the code on the real hardware.

I post my solution here, in case someone just want a rough estimation as me.
Eclipse embedded CDT provides a good start point. You can start with a simple LED blink template. It support soft FP arithmatic only now. You can start qemu with the built embedded program, and a picture of the STM32F407 board will appear. The LED on the picture will blink as the program goes.
The key point is I can use the script from Counting machine instructions using gdb to count instruction on the qemu target.
However, it seems eclipse embedded cdt will stuck when some library code is executed. Here is my work around, start qemu mannually(the command is got by command 'ps' when eclipse start qemu):
In the first terminal:
qemu-system-gnuarmeclipse --verbose --verbose --board STM32F4-Discovery --mcu STM32F407VG --gdb tcp::1235 -d unimp,guest_errors --semihosting-config enable=on,target=native --semihosting-cmdline blinky_c
Then in the second terminal:
arm-none-eabi-gdb blinky_c.elf
and below is the command history I input in the gdb terminal
(gdb) show commands
1 target remote :1235
2 load
3 info register
4 set $sp = 0x20020000
5 info register
6 b main
7 c
Then you can use the gdb to count instruction as in Counting machine instructions using gdb.
One big problem with the method is the speed is really slow, as gdb will use stepi to go through all the code to be counted before get a result. It cost me around 3 hours in my ubuntu VMware machine to get 5.5M instruction executed.

One thing that you can do is use a simulation setup like the one used in this sample: https://github.com/swedishembedded/sdk/blob/main/samples/lib/control/dcmotor/src/main.c
This may look like an ordinary embedded application, but the data structure vdev actually resides in a different application running on the computer (in this case a dc motor simulator) and all reads and writes to it are automatically done over network by the simulator that runs this. The platform definition is here: https://github.com/swedishembedded/sdk/blob/main/samples/lib/control/dcmotor/boards/custom_board.repl This is how the structure is mapped.
From here it is not hard to implement advanced memory profiling by directly capturing reads and writes from the simulated application (which in this case is compiled for STM32 ARM).

Is it possible to compile and run the dlib library on embedded devices with ARM Cortex-M7 processors?

I have just started using the amazing dlib library in Visual Studio and I have been able to compile and run the face detection examples. I was wondering if it would be possible to compile and run the library on an Mbed device, such as this one, with an M7 (or other M-series) processor. In other words, what specifications should I look out for to determine whether a microcontroller can, if at all, run dlib. Note that Mbed devices run C++ code, so it would be possible to copy and paste the source code of dlib and compile it, but I want to know if this is possible before I purchase a board. Also, if the RAM and ROM of the board are not enough, I can always attach external RAM/ROM.
Alternatively, if anyone knows of a library that can perform face detection or recognition on an embedded device, I would be happy to hear it.
Thanks.

Although the F769 is a considerably powerful embedded device there is no chance that dlib will run on it. Machine learning algorithms, even if not run in real-time, typically require a vast amount of RAM memory, specially for online-learning (learning on the target). You can take a look at ARMs very own CMSIS NN library to see what's currently "state-of-the-art" for devices that size.

Take a look at Tensorflow Lite for Microcontrollers. You can put these on embedded devices. Wake words and object detection runs easily on various boards (Arduino Nano 33, SparkFun Edge). There's a compiler included for Mbed.

Microcontrollers are not suitable for video and image recognition even if you attach external ram. The chip you where suggestiong is top of the line in microcontroller world. But this means only 2Mb for ALL your software and only 512kb of ram onboard. Think of it this way the image you need with enough detail to recoginze someone would be atleast a few mb.
I would suggest that you look at to the application processors of ARM (A series) or NVIDA Jetson.

Profiling a clutter-box2d application on ARMv7

What is the best way to profile and optimize clutter-box2d application on an arm target?
I have tried using valgrind to profile the code on x86 before porting, but it doesnt seem to help. Ported application still runs considerably slow on ARM target.
I wasn't able to get valgrind working properly on arm target to profile and identify bottlenecks.
Used a bit of Oprofile but it gives a system wide snapshot and doesnt do much good. Since it does not produce call-graphs.

If all things fail (and you are on a glibc-based system) you can go the traditional route and use gprof to collect profiling data.
http://en.wikipedia.org/wiki/Gprof

Profiling on baremetal embedded systems (ARM)

I am wondering how you profile software on bare metal systems (ARM Cortex a8)? Previously I was using a simulator which had built-in benchmark statistics, and now I want to compare results from real hardware (running on a BeagleBoard-Xm).
I understand that you can use gprof, however I'm kind of lost as that assumes you have to run Linux on the target system?
I build the executable file with Codesourcery's arm-none-eabi cross-compiler and the target system is running FreeRTOS.

Closely evaluate what you mean by "profiling". You are indeed operating very close to bare metal, and it's likely that you will be required to take on some of the work performed by a tool like gprof.
Do you want to time a function call? or an ISR? How about toggling a GPIO line upon entering and exiting the code under inspection. A data logger or oscilloscope can be set to trigger on these events. (In my experience, a data logger is more convenient since mine could be configured to capture a sequence of these events - allowing me to compute average timings.)
Do you want to count the number of executions? The Cortex A8 comes equipped with a number of features (like configurable event counters) that can assist: link. Your ARM chip may be equipped with other peripherals that could be used, as well (depending on the vendor). Regardless, take a look at the above link - the new ARMs have lots of cool features that I don't get to play with as much as I would like! ;-)

I have managed to get profiling working for ARM Cortex M. As the GNU ARM Embedded (launchpad) tools do not come with profiling libraries included, I have added the necessary glue and profiling functionality.
References:
See http://mcuoneclipse.com/2015/08/23/tutorial-using-gnu-profiling-gprof-with-arm-cortex-m/
I hope this helps.

How do you profile your code?

I hope not everyone is using Rational Purify.
So what do you do when you want to measure:
time taken by a function
peak memory usage
code coverage
At the moment, we do it manually [using log statements with timestamps and another script to parse the log and output to excel. phew...)
What would you recommend? Pointing to tools or any techniques would be appreciated!
EDIT: Sorry, I didn't specify the environment first, Its plain C on a proprietary mobile platform

I've done this a lot. If you have an IDE, or an ICE, there is a technique that takes some manual effort, but works without fail.
Warning: modern programmers hate this, and I'm going to get downvoted. They love their tools. But it really works, and you don't always have the nice tools.
I assume in your case the code is something like DSP or video that runs on a timer and has to be fast. Suppose what you run on each timer tick is subroutine A. Write some test code to run subroutine A in a simple loop, say 1000 times, or long enough to make you wait at least several seconds.
While it's running, randomly halt it with a pause key and sample the call stack (not just the program counter) and record it. (That's the manual part.) Do this some number of times, like 10. Once is not enough.
Now look for commonalities between the stack samples. Look for any instruction or call instruction that appears on at least 2 samples. There will be many of these, but some of them will be in code that you could optimize.
Do so, and you will get a nice speedup, guaranteed. The 1000 iterations will take less time.
The reason you don't need a lot of samples is you're not looking for small things. Like if you see a particular call instruction on 5 out of 10 samples, it is responsible for roughly 50% of the total execution time. More samples would tell you more precisely what the percentage is, if you really want to know. If you're like me, all you want to know is where it is, so you can fix it, and move on to the next one.
Do this until you can't find anything more to optimize, and you will be at or near your top speed.

You probably want different tools for performance profiling and code coverage.
For profiling I prefer Shark on MacOSX. It is free from Apple and very good. If your app is vanilla C you should be able to use it, if you can get hold of a Mac.
For profiling on Windows you can use LTProf. Cheap, but not great:
http://successfulsoftware.net/2007/12/18/optimising-your-application/
(I think Microsoft are really shooting themself in the foot by not providing a decent profiler with the cheaper versions of Visual Studio.)
For coverage I prefer Coverage Validator on Windows:
http://successfulsoftware.net/2008/03/10/coverage-validator/
It updates the coverage in real time.

For complex applications I am a great fan of Intel's Vtune. It is a slightly different mindset to a traditional profiler that instruments the code. It works by sampling the processor to see where instruction pointer is 1,000 times a second. It has the huge advantage of not requiring any changes to your binaries, which as often as not would change the timing of what you are trying to measure.
Unfortunately it is no good for .net or java since there isn't a way for the Vtune to map instruction pointer to symbol like there is with traditional code.
It also allows you to measure all sorts of other processor/hardware centric metrics, like clocks per instruction, cache hits/misses, TLB hits/misses, etc which let you identify why certain sections of code may be taking longer to run than you would expect just by inspecting the code.

If you're doing an 'on the metal' embedded 'C' system (I'm not quite sure what 'mobile' implied in your posting), then you usually have some kind of timer ISR, in which it's fairly easy to sample the code address at which the interrupt occurred (by digging back in the stack or looking at link registers or whatever). Then it's trivial to build a histogram of addresses at some combination of granularity/range-of-interest.
It's usually then not too hard to concoct some combination of code/script/Excel sheets which merges your histogram counts with addresses from your linker symbol/list file to give you profile information.
If you're very RAM limited, it can be a bit of a pain to collect enough data for this to be both simple and useful, but you would need to tell us a more about your platform.

nProf - Free, does that for .NET.
Gets the job done, at least enough to see the 80/20. (20% of the code, taking 80% of the time)

Windows (.NET and Native Exes): AQTime is a great tool for the money. Standalone or as a Visual Studio plugin.
Java: I'm a fan of JProfiler. Again, can run standalone or as an Eclipse (or various other IDEs) plugin.
I believe both have trial versions.

The Google Perftools are extremely useful in this regard.

I use devpartner with MSVC 6 and XP

How are any tools going to work if your platform is a proprietary OS? I think you're doing the best you can right now

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight