how to know clock cycles, overall performance, etc... of program? - c

I have 3 different algorithms which all calculate the same stuff.
My goal is to compare all three algorithms, i.e. clock cycles, "how intensive it is for the processor", time needed to get the final result, the overall performance etc...
How can I see/get/analyze all of this information?
I am programming in Matlab and in C-language in code composer studio for an embedded system.
EDIT: memory usage/management would be usefull as well for the embedded system especially

First you can compare the size of your Output-files. Most times the bigger one is slower.
Get the exactly clock cycles is not easy. you must know how many clock cycles your Assembler command Needs and calculate it for your code.
If you are running it directly on your Hardware, you can toggle a port at the start and end Point and do a Timing measurement. (Regard there are may Interrupts, that can slow you down)

For the MATLAB part, you should use the timeit function to evaluate performance. You can also use profile to inspect which, if any, parts of the code are causing bottlenecks.

Related

Benchmarking microcontrollers

currently I am working on setting up benchmark between microcontrollers (based on Powerpc). So I would like to know, if anyone can provide me some documentation showing in detail, what factors are most important to be considered for benchmarking?
In other words I am looking for documentation which provides detailed information about factors that should be considered for enhancement in the performance of
Core
Peripherals,
Memory banks
Plus, if someone could provide algorithms that will be lot helpful.
There is only one useful way and that is to write your application for both and time your application. Benchmarks are for the most part bogus there are too many factors and it is quite trivial to craft a benchmark that takes advantage of the differences, or even takes advantage of the common features in a way to make two things look different.
I perform this stunt on a regular basis, most recently this code
.globl ASMDELAY
ASMDELAY:
subs r0,r0,#1
bne ASMDELAY
bx lr
Run on a raspberry pi (bare metal) the same raspberry pi not comparing two just comparing it to itself, clearly assembly so not even taking into account compiler features/tricks that you can encode in the benchmark intentionally or accidentally. Two of those three instructions matter for benchmarking purposes, have the loop run many tens of thousands of times I think I used 0x100000. The punchline to that performance was those two instructions in a loop ran as fast as 93662 timer ticks and as slow as 4063837 timer ticks for 0x10000 loops. Certainly i cache and branch prediction were turned on and off for various tests. But even with both branch prediction on and the i cache on, these two instructions will vary in speed depending on where they lie within the fetch line and the cache line.
A microcontroller makes this considerably worse depending on what you are comparing, some have flashes that can use the same wait state for a wide range of clock speeds, some are speed limited and for every N Mhz you have to add another wait state, so depending on where you set your clock it affects performance across that range and definitely just below and just above the boundary where you add a wait state (24Mhz minus a smidge and 24Mhz with an extra wait state if it was from 2-3 wait states then fetching just got 50% slower 36Mhz minus a smidge it may still be at the 3 wait states but 3 wait states at 36minus a smidge is faster than 24mhz 3 wait states). if you run the same code in sram vs flash for those platforms there usually isnt a wait state issue the sram can usually match the cpu clock and so that code at any speed may be faster than the same code run from flash.
If you are comparing two microcontrollers from the same vendor and family then it is usually pointless, the internals are the same they usually just vary by how many, how many flash banks how many sram banks how many uarts, how many timers, how many pins, etc.
One of my points being if you dont know the nuances of the overall architecture, you can possibly make the same code you are running now on the same board a few percent to tens of times faster by simply understanding how things work. Enabling features you didnt know where there, proper alignment of the code that is exercised often (simply re-arranging your functions within a C file can/will affect performance) adding one or more nops in the bootstrap to change the alignment of the whole program can and will change performance.
Then you get into compiler differences and compiler options, you can play with those and also get some to several to dozens of times improvement (or loss).
So at the end of the day the only thing that matters is I have an application it is the final binary and how fast does it run on A, then I ported that application and the final binary for B is done and how fast does it run there. Everything else can be manipulated, the results cant be trusted.

how to count cycles?

I'm trying to find the find the relative merits of 2 small functions in C. One that adds by loop, one that adds by explicit variables. The functions are irrelevant themselves, but I'd like someone to teach me how to count cycles so as to compare the algorithms. So f1 will take 10 cycles, while f2 will take 8. That's the kind of reasoning I would like to do. No performance measurements (e.g. gprof experiments) at this point, just good old instruction counting.
Is there a good way to do this? Are there tools? Documentation? I'm writing C, compiling with gcc on an x86 architecture.
http://icl.cs.utk.edu/papi/
PAPI_get_real_cyc(3) - return the total number of cycles since some arbitrary starting point
Assembler instruction rdtsc (Read Time-Stamp Counter) retun in EDX:EAX registers the current CPU ticks count, started at CPU reset. If your CPU runing at 3GHz then one tick is 1/3GHz.
EDIT:
Under MS windows the API call QueryPerformanceFrequency return the number of ticks per second.
Unfortunately timing the code is as error prone as visually counting instructions and clock cycles. Be it a debugger or other tool or re-compiling the code with a re-run 10000000 times and time it kind of thing, you change where things land in the cache line, the frequency of the cache hits and misses, etc. You can mitigate some of this by adding or removing some code upstream from the module of code being tested, (to cause a few instructions added and removed changing the alignment of your program and sometimes of your data).
With experience you can develop an eye for performance by looking at the disassembly (as well as the high level code). There is no substitute for timing the code, problem is timing the code is error prone. The experience comes from many experiements and trying to fully understand why adding or removing one instruction made no or dramatic differences. Why code added or removed in a completely different unrelated area of the module under test made huge performance differences on the module under test.
As GJ has written in another answer I also recommend using the "rdtsc" instruction (rather than calling some operating system function which looks right).
I've written quite a few answers on this topic. Rdtsc allows you to calculate the elapsed clock cycles in the code's "natural" execution environment rather than having to resort to calling it ten million times which may not be feasible as not all functions are black boxes.
If you want to calculate elapsed time you might want to shut off energy-saving on the CPUs. If it's only a matter of clock cycles this is not necessary.
If you are trying to compare the performance, the easiest way is to put your algorithm in a loop and run it 1000 or 1000000 times.
Once you are running it enough times that the small differences can be seen, run time ./my_program which will give you the amount of processor time that it used.
Do this a few times to get a sampling and compare the results.
Trying to count instructions won't help you on x86 architecture. This is because different instructions can take significantly different amounts of time to execute.
I would recommend using simulators. Take a look at PTLsim it will give you the number of cycles, other than that maybe you would like to take a look at some tools to count the number of times each assembly line is executing.
Use gcc -S your_program.c. -S tells gcc to generate the assembly listing, that will be named your_program.s.
There are plenty of high performance clocks around. QueryPerformanceCounter is microsofts. The general trick is to run the function 10s of thousands of time and time how long it takes. Then divide the time taken by the number of loops. You'll find that each loop takes a slightly different length of time so this testing over multiple passes is the only way to truly find out how long it takes.
This is not really a trivial question. Let me try to explain:
There are several tools on different OS to do exactly what you want, but those tools are usually part of a bigger environment. Every instruction is translated into a certain number of cycles, depending on the CPU the compiler ran on, and the CPU the program was executed.
I can't give you a definitive answer, because I do not have enough data to pass my judgement on, but I work for IBM in the database area and we use tools to measure cycles and instructures for our code and those traces are only valid for the actual CPU the program was compiled and was running on.
Depending on the internal structure of your CPU's piplining and on the effeciency of your compiler, the resulting code will most likely still have cache misses and other areas you have to worry about. (In that case you may want to look into FDPR...)
If you want to know how many cycles your program needs to run on your CPU (which was compiled with your compiler), you have to understand how the CPU works and how the compiler generarated the code.
I'm sorry, if the answer was not sufficient enough to solve your problem at hand. You said you are using gcc on an x86 arch. I would work with getting the assembly code mapped to your CPU.
I'm sure you will find some areas, where gcc could have done a better job...

function to purposefully have a high cpu load?

I'm making a program which controls other processes (as far as stopped and started states).
One of the criteria is that the load on the computer is under a certain value.
So i need a function or small program which will cause the load to be very high for testing purposes. Can you think of anything?
Thanks.
I can think of this one :
for(;;);
If you want to actually generate peak load on a CPU, you typically want a modest-size (so that the working set fits entirely in cache) trivially parallelizable task, that someone has hand-optimized to use the vector unit on a processor. Common choices are things like FFTs, matrix multiplications, and basic operations on mathematical vectors.
These almost always generate much higher power and compute load than do more number-theoretic tasks like primality testing because they are essentially branch-free (other than loops), and are extremely homogeneous, so they can be designed to use the full compute bandwidth of a machine essentially all the time.
The exact function that should be used to generate a true maximum load varies quite a bit with the exact details of the processor micro-architecture (different machines have different load/store bandwidth in proportion to the number and width of multiply and add functional units), but numerical software libraries (and signal processing libraries) are great things to start with. See if any that have been hand tuned for your platform are available.
If you need to control how long will the CPU burst be, you can use something like the Sieve of Eratosthenes (algorithm to find primes until a certain number) and supply a smallish integer (10000) for short bursts, and a big integer (100000000) for long bursts.
If you will take I/O into account for the load, you can write to a file per each test in the sieve.

Estimate Power Consumption Based on Running Time Analysis / Code Size

I've developed and tested a C program on my PC and now I want to give an estimate of the power consumption required for the program to do a single run. I've analysised the running time of the application and of invidiual function calls within the application and I know the code size both in assembly lines, but also raw C lines.
How would I give an estimate of the power consumption based on the performance analysis and/code size? I suppose it scales with the amount of lines that uses the CPU for computations or does memory access but I was hoping for a more precise answer.
Also, how would I tell the difference between the power consumption on say my PC compared to a on a microchip device?
Good luck. What you want to do is pretty much impossible on a desktop PC. Best you could probably do would be to measure the from-the-wall power draw at idle, and when running your program, with as few other programs as possible running at the same time. Average the results over 100 or so runs, and you should have a value with accuracy of a few percent (standard statistical disclaimers apply).
On a Microchip device, it should be easier to calculate the power consumption, since they publish (average) power consumption values for the various modes, and the timing is deterministic. Unfortunately, there are so many differences between a processor like that and your desktop processor (word size, pipelining, multiple-issue, multiple processes, etc, etc) that there really won't be any effective way to compare the two.
There is a paper on Intel's website that gives average energy per instruction for various processors. They give 11 nJ per instruction for Core Duo, for example. How useful that'll be for you depends on how much your code looks like the SpecInt benchmark, I guess.

How to calculate MIPS for an algorithm for ARM processor

I have been asked recently to produced the MIPS (million of instructions per second) for an algorithm we have developed. The algorithm is exposed by a set of C-style functions. We have exercise the code on a Dell Axim to benchmark the performance under different input.
This question came from our hardware vendor, but I am mostly a HL software developer so I am not sure how to respond to the request. Maybe someone with similar HW/SW background can help...
Since our algorithm is not real time, I don't think we need to quantify it as MIPS. Is it possible to simply quote the total number of assembly instructions?
If 1 is true, how do you do this (ie. how to measure the number of assembly instructions) either in general or specifically for ARM/XScale?
Can 2 be performed on a WM device or via the Device Emulator provided in VS2005?
Can 3 be automated?
Thanks a lot for your help.
Charles
Thanks for all your help. I think S.Lott hit the nail. And as a follow up, I now have more questions.
5 Any suggestion on how to go about measuring MIPS? I heard some one suggest running our algorithm and comparing it against Dhrystone/Whetstone benchmark to calculate MIS.
6 Since the algorithm does not need to be run in real time, is MIPS really a useful measure? (eg. factorial(N)) What are other ways to quantity the processing requirements? (I have already measured the runtime performance but it was not a satisfactory answer.)
7 Finally, I assume MIPS is a crude estimate and would be dep. on compiler, optimization settings, etc?
I'll bet that your hardware vendor is asking how many MIPS you need.
As in "Do you need a 1,000 MIPS processor or a 2,000 MIPS processor?"
Which gets translated by management into "How many MIPS?"
Hardware offers MIPS. Software consumes MIPS.
You have two degrees of freedom.
The processor's inherent MIPS offering.
The number of seconds during which you consume that many MIPS.
If the processor doesn't have enough MIPS, your algorithm will be "slow".
if the processor has enough MIPS, your algorithm will be "fast".
I put "fast" and "slow" in quotes because you need to have a performance requirement to determine "fast enough to meet the performance requirement" or "too slow to meet the performance requirement."
On a 2,000 MIPS processor, you might take an acceptable 2 seconds. But on a 1,000 MIPS processor this explodes to an unacceptable 4 seconds.
How many MIPS do you need?
Get the official MIPS for your processor. See http://en.wikipedia.org/wiki/Instructions_per_second
Run your algorithm on some data.
Measure the exact run time. Average a bunch of samples to reduce uncertainty.
Report. 3 seconds on a 750 MIPS processor is -- well -- 3 seconds at 750 MIPS. MIPS is a rate. Time is time. Distance is the product of rate * time. 3 seconds at 750 MIPS is 750*3 million instructions.
Remember Rate (in Instructions per second) * Time (in seconds) gives you Instructions.
Don't say that it's 3*750 MIPS. It isn't; it's 2250 Million Instructions.
Some notes:
MIPS is often used as a general "capacity" measure for processors, especially in the soft real-time/embedded field where you do want to ensure that you do not overload a processor with work. Note that this IS instructions per second, as the time is very important!
MIPS used in this fashion is quite unscientific.
MIPS used in this fashion is still often the best approximation there is for sizing a system and determining the speed of the processor. It might well be off by 25%, but never mind...
Counting MIPS requires a processor that is close to what you are using. The right instruction set is obviously crucial, to capture the actual instruction stream from the actual compiler in use.
You cannot in any way approximate this on a PC. You need to bring out one of a few tools to do this right:
Use an instruction-set simulator for the target archicture such as Qemu, ARM's own tools, Synopsys, CoWare, Virtutech, or VaST. These are fast but can count instructions pretty well, and will support the right instruction set. Barring extensive use of expensive instructions like integer divide (and please no floating point), these numbers tend to be usefully close.
Find a clock-cycle accurate simulator for your target processor (or something close), which will give pretty good estimate of pipeline effects etc. Once again, get it from ARM or from Carbon SoCDesigner.
Get a development board for the processor family you are targeting, or an ARM close to it design, and profile the application there. You don't use an ARM9 to profile for an ARM11, but an ARM11 might be a good approximation for an ARM Cortex-A8/A9 for example.
MIPS is generally used to measure the capability of a processor.
Algorithms usually take either:
a certain amount of time (when running on a certain processor)
a certain number of instructions (depending on the architecture)
Describing an algorithm in terms of instructions per second would seem like a strange measure, but of course I don't know what your algorithm does.
To come up with a meaningful measure, I would suggest that you set up a test which allows you to measure the average time taken for your algorithm to complete. Number of assembly instructions would be a reasonable measure, but it can be difficult to count them! Your best bet is something like this (pseudo-code):
const num_trials = 1000000
start_time = timer()
for (i = 1 to num_trials)
{
runAlgorithm(randomData)
}
time_taken = timer() - start_time
average_time = time_taken / num_trials
MIPS are a measure of CPU speed, not algorithm performance. I can only assume the somewhere along the line, someone is slightly confused. What are they trying to find out? The only likely scenario I can think of is they're trying to help you determine how fast a processor they need to give you to run your program satisfactorily.
Since you can measure an algorithm in number of instructions (which is no doubt going to depend on the input data, so this is non-trivial), you then need some measure of time in order to get MIPS -- for instance, say "I need to invoke it 1000 times per second". If your algorithm is 1000 instructions for that particular case, you'll end up with:
1000 instructions / (1/1000) seconds = 1000000 instructions per second = 1 MIPS.
I still think that's a really odd way to try to do things, so you may want to ask for clarification. As for your specific questions, I'll leave that to someone more familiar with Visual Studio.
Also remember that different compilers and compiler options make a HUGE difference. The same source code can run at many different speeds. So instead of buying the 2mips processor you may be able to use the 1/2mips processor and use a compiler option. Or spend the money on a better compiler and use the cheaper processor.
Benchmarking is flawed at best. As a hobby I used to compile the same dhrystone (and whetstone) code on various compilers from various vendors for the same hardware and the numbers were all over the place, orders of magnitude. Same source code same processor, dhrystone didnt mean a thing, not useful as a baseline. What matters in benchmarking is how fast does YOUR algorithm run, it had better be as fast or faster than it needs to. Depending on how close to the finish line you are allow for plenty of slop. Early on on probably want to be running 5 or 10 or 100 times faster than you need to so that by the end of the project you are at least slightly faster than you need to be.
I agree with what I think S. Lott is saying, this is all sales and marketing and management talk. Being the one that management has put between a rock and the hard place then what you need to do is get them to buy the fastest processor and best tools that they are willing to spend based on the colorful pie charts and graphs that you are going to generate from thin air as justification. If near the end of the road it doesnt quite meet performance, then you could return to stackoverflow, but at the same time management will be forced to buy a different toolchain at almost any price or swap processors and respin the board. By then you should know how close to the target you are, we need 1.0 and we are at 1.25 if we buy the processor that is twice as fast as the one we bought we should make it.
Whether or not you can automate these kinds of things or simulate them depends on the tools, sometimes yes, sometimes no. I am not familiar with the tools you are talking about so I cant speak to them directly.
This response is not intended to answer the question directly, but to provide additional context around why this question gets asked.
MIPS for an algorithm is only relevant for algorithms that need to respond to an event within the required time.
For example, consider a controller designed to detect the wind speed and move the actuator within a second when the wind speed crosses over 25 miles / hour. Let us say it takes 1000 instructions to calculate and compare the wind speed against the threshold. The MIPS requirement for this algorithm is 1 Kilo Instructions Per Second (KIPs). If the controller is based on 1 MIPS processor, we can comfortably say that there is more juice in the controller to add other functions.
What other functions could be added on the controller? That depends on the MIPS of the function/algorithm to be added. If there is another function that needs 100,000 instructions to be performed within a second (i.e. 100 KIPs), we can still accommodate this new function and still have some room for other functions to add.
For a first estimate a benchmark on the PC may be useful.
However, before you commit to a specific device and clock frequency you should get a developer board (or some PDA?) for the ARM target architecture and benchmark it there.
There are a lot of factors influencing the speed on today's machines (caching, pipelines, different instruction sets, ...) so your benchmarks on a PC may be way off w.r.t. the ARM.

Resources