How to measure ARM performance? - c

I'm working with optimizing a software and wants to measure the performance. So I am currently simulating an ARM platform with OVP (open virtual platform) and I get the statistics as simulation time and simulated instructions.
My question is, why is the simulated instructions different everytime I run the software (different, but close proximity)? Should it not be the same everytime? Is it not like this , the software that I write in C will be compiled into ARM assembler instructions, and each time the software runs, the simulated instructions will be how many time these ARM assembler instructions run? It should be the same everytime?
How should I measure performance? Take 10 samples of simulated instructions and get the average?

From my experience in a real (non-simulated) ARM, if I take cycle counts for a section of the code the number of cycles will vary, this is because:
There can be context switches in the middle of your executing code.
The initial state of the CPU may be different upon entering the code section. (e.g. the content of the pipeline, branch prediction etc.)
The cache state will be different on entry to the code section.
External factors such as other hardware accessing external memory.
Due to all these, taking an average (plus some other statistical measures) is really the only practical approach for real hardware and a real OS. In a good simulator some of these factors or potentially eliminated.
On some real chips (or if supported by the simulator) the ARM Performance Monitoring Unit can be useful.
If you're coding for the Cortex A8 this is a cool online cycle counter that can really help you squeeze more performance out of your code.

Related

How to microbenchmark an algorithm on an ARM laptop?

On a servers or a desktops it's reasonable to just disable frequency scaling and then one can run microbenchmarks.
But how do you meaningfully run microbenchmarks on a arm? (Given different cores and energy efficiency)
Specific context:
Processor: M2
Microbenchmakrs for basic algorithms (think sort, strlen - a long those lines).
I am interested in measuring both energy efficient and high-performance cores.
UPD: to avoid confusion: I'm pretty sure the library (GoogleBenchmark) can measure the time correctly. I just want to run the binary correctly

Full software timer : derivate time?

I have been asked a question but I am not sure if I answered it correctly.
"Is it possible to rely only on software timer?"
My answer was "yes, in theory".
But then I added:
"Just relying on hardware timer at the kernel loading (rtc) and then
software only is a mess to manage since we must be able to know
how many cpu cycles each instruction took + eventual cache miss +
branching cost + memory speed and put a counter after each one or
group (good luck with out-of-order cpu).
And do the calculation to derivate the current cpu cycle. That is
insane.
Not talking about the overall performance drop.
The best we could have is a brittle approximation of the time which
become more wrong over time. Even possibly on short laps."
But even if it seems logical to me, did my thinking go wrong?
Thanks
On current processors and hardware (e.g. Intel or AMD or ARM in laptops or desktops or tablets) with common operating systems (Linux, Windows, FreeBSD, MacOSX, Android, iOS, ...) processes are scheduled at random times. So cache behavior is non deterministic. Hence, instruction timing is non reproducible. You need some hardware time measurement.
A typical desktop or laptop gets hundreds, or thousands, of interrupts every second, most of them time related. Try running cat /proc/interrupts on a Linux machine twice, with a few seconds between the runs.
I guess that even with a single-tasked MS-DOS like operating system, you'll still get random behavior (e.g. induced by ACPI, or SMM). On some laptops, the processor frequency can be throttled by its temperature, which depends upon the CPU load and the external temperature...
In practice you really want to use some timer provided by the operating system. For Linux, read time(7)
So you practically cannot rely on a purely software timer. However, the processor has internal timers.... Even in principle, you cannot avoid timers on current processors ....
You might be able, if you can put your hardware in a very controlled environment (thermostatically) to run a very limited software (an OS-like free standing thing) sitting entirely in the processor cache and perhaps then get some determinism, but in practice current laptop or desktop (or tablet) hardware is non-deterministic and you cannot predict the time needed for a given small machine routine.
Timers are extremely useful in interesting (non-trivial) software, see e.g. J.Pitrat CAIA, a sleeping beauty blog entry for an interesting point. Also look at the many uses of watchdog timers in software (e.g. in the Parma Polyhedra Library)
Read also about Worst Case Execution Time (WCET).
So I would say that even in theory it is not possible to rely upon a purely software timer (unless of course that software uses the processor timers, which are hardware circuits). In the previous century (up to 1980s or 1990s) hardware was much more deterministic, and the amount of clock cycles or microsecond needed for each machine instruction was documented (but some instructions, e.g. division, needed a variable amount of time, depending on the actual data!).

Can you check performance of a program running with Qemu Simulator?

Say if I am running an ARM simulator using Qemu, is it possible to find the time of execution of a program as it would be on the real ARM processor. In other words if I use functions such as gettimeofday, in a program running on the simulator, to check the elapsed time, will the elapsed time be given accurately through the cycle-accurate simulation?
Investigation in this issue at our company concluded that Qemu (for the ARM) is not cycle accurate. If I remember correctly cycle accuracy is not a goal of Qemu, instead it aims at fast emulation. Beware also that exact timing is dependent on quite unpredictable things like cache hits and misses. It will also depend on the actual architecture chosen. Note that ARM is merely an instruction set IP and several different implementations exist. If in addition an operating system is emulated, things get even more unpredictable.
We use the simulator from ARM to evaluate performance, but even that one is not fully cycle accurate for the latest versions of the ARM architecture.
GEM5
I have seen a researcher use gem5 for this. This paper evaluates how accurate it is. And I've created an easy to get started setup on GitHub.
As Bryan mentioned QEMU is designed for speed: only a valid x86 API behavior must be reached, not necessarily with the right number of cycles or in the same pipeline order. This is also called functional emulation.
Furthermore, DRAM memory accesses are assumed to be immediate, and therefore it makes no sense to emulate caches either. And as we know, current CPUs are basically memory latency hiding machines.
Cycle accurate emulators on the other hand, also emulate CPU internals, and are therefore way slower.
The root of the problem is of course the under documented performance features of processors, which vendors don't release to prevent intellectual property leakage.
GEM5 appears to implement a generic version of common CPU internals, so it should be more cycle accurate than functional emulators, but true cycle accurate emulation is likely impossible without insider knowledge.
Third party emulation implementors must then reverse engineer CPU performance from experiments and existing documentation.
Some of the key "internals" are cache, pipeline and branch prediction.
Related:
Question that asks how cycle accurate emulators are possible at all: How can CAS simulators like PTLsim achieve cycle accurate simulation of x86 hardware?
ARM Cycle-Accurate Simulator

Same codebase for CPU and GPU

Does anybody have any experience in maintaining single codebase for both CPU and GPU?
I want to create an application which when possible would use GPU for some long lasting calculations, but if a compatible GPU is not present on a target machine it would just use regular CPU version. It would be really helpfull if I could just write a portion of code using conditional compilation directives which would compile both to a CPU version and GPU version. Of course there will be some parts which are different for CPU and GPU, but I would like to keep the essense of the algorithm in one place. Is it at all possible?
OpenCL is a C-based language. OpenCL platforms exist that run on GPUs (from NVidia and AMD) and CPUs (from Intel and AMD).
While it is possible to execute the same OpenCL code on both GPUs and CPUs, it really needs to be optimized for the target device. Different code would need to be written for different GPUs and CPUs to gain the best performance. However, a CPU OpenCL platform can function as a low-performance fallback for even GPU optimized code.
If you are happy writing conditional directives that execute depending on the target device (CPU or GPU) then that can help performance of OpenCL code on multiple devices.

Any open-source ARM7 emulators suitable for linking with C?

I have an open-source Atari 2600 emulator (Z26), and I'd like to add support for cartridges containing an embedded ARM processor (NXP 21xx family). The idea would be to simulate the 6507 until it tries to read or write a byte of memory (which it will do every 841ns). If the 6507 performs a write, put the address and data on some of the ARM's I/O ports and let the ARM code run 20 cycles, confirm that the ARM is floating its data bus, and let the ARM run for another 38 cycles. If the 6507 performs a read, put the address on the ARM's I/O ports, let the ARM run 38 cycles, grab the data from the ARM's I/O port (hopefully the ARM software will have put it there), and let the ARM run another 20 cycles.
The ARM7 seems pretty straightforward to implement; I don't need to simulate a whole lot of hardware features. Any thoughts?
Edit
What I have in mind would be a routine that would take as a parameter a struct holding the machine state and pointers to a memory access routine. When called, the routine would emulate the ARM's instruction engine, generating appropriate reads, writes, and code fetches. I could then write the memory access routine to regard appropriate areas as flash (with roughly-approximated wait states), RAM, I/O ports, and timer registers. Some other areas would be marked as don't-care, and accesses to any other areas would flag an error and stop the emulator.
Perhaps QEMU uses such a thing internally. Since the ARM emulation would be integrated into an already-existing emulation engine (which I didn't write and don't fully understand--the only parts of Z26 I've patched have been the memory read/write logic) I would need something with a fairly small footprint.
Any idea how QEMU works inside? Any idea what the GPL licence would require if I just use 2% of the code in QEMU--whether I'd have to bundle the code for the whole thing, or just the part that I use, or what?
Try QEMU.
With some work, you can make my emulator do what you want. It was written for ARM920, and the Thumb instruction set isn't done yet. Neither is the MMU/cache interface. Also, it's slow because it is an interpreter. On the bright side, it's all written in C99.
http://code.google.com/p/gp2xemu/
I haven't worked on it for a while (The svn trunk is 2 years old), but if you're going to use the code, I'll be glad to help you out with the missing features. It is licensed under MIT, so it's just the same as the broad BSD license.

Resources