Energy consumption of an algorithm in c code - c

I need to calculate the energy consumption of an algorithm in c code. Any ideas how this can be done and if there are pre-defined functions for that?
Thanks in advance,

It is nigh impossible to do it from C code alone. The multiple ways a C-compiler is allowed to translate a chunk of C-code do not allow for a precise calculation of the energy consumption. You need to know how the compiler translates which code for which architecture.
It is much simpler (for certain degrees of "simple") to count the assembler commands and multiply them with the respective latencies (Listed in the technical specs, e.g. for a randomly chosen MCU from the Wiki list: http://www.atmel.com/Images/doc32002.pdf). This is still not exact, division for example might take a different amount of CPU cycles depending on the input, CPU architecture and implementation in the hardware, but it comes quite close and is rather simple, although a bit tedious.
And than there are the loops with a completely unknown number of iterations, input taking different paths with different runtimes and much more. The people writing encryption software know more about it, especially how to avoid it. You might not like their solutions.
Otherwise: check what you expect for input (you do know which path which input takes, do you?), write a test program, go to e.g.: https://www.rohde-schwarz.com/ to get a good meter and measure the power consumption. You also need an engineer who knows how to do that, it's not easy!

Related

How can I prove or disprove the efficiency of compilation?

This is an unusual question, but I do hope there's a definitive answer.
There's a longstanding debate in our office about how efficiently compilers generate code, specifically number of instructions. We write code for low power embedded systems with virtually no loops. Therefore, the number of instructions emitted is directly proportional to power consumed.
Much of our code looks like this (notice, no dynamic memory allocation, no system calls, very few function calls, very few loops).
foo += 3 * (77 + bar);
if (baz > 18 - qux)
bar -= 19 + 7 >> spam;
I can compile the above snippet with -O3 and read the assembly, but I couldn't write it myself.
The claim I would like to prove or disprove is that compilers generate code that is 2-4X "fatter" (and therefore consume 2-4X times as much power) compared with hand written assembly code.
I'm interested in any compiler with which you have experience.
From this answer I know that GCC and clang can emit assembly interleaved with the C code with
gcc -g -c -Wa,-alh foo.cc
These answers provide solid basis:
When is assembly faster?
Why do you program in assembly?
How can I measure the efficiency with which a compiler generates code?
Hand assembly can always at least match if not beat the compiler, because at the very least, you can start with the compiler generated assembly code and tweak it to make it better. To really do a good job, you need to understand the CPU architecture (pipeline, functional units, memory hierarchy, out-of-order dispatch units, etc.) so that you can schedule each instruction for maximum efficiency.
Another thing to consider is that the number of instructions is not necessarily directly proportional to performance, whether it is speed or power (see Hennessey and Patterson's Computer Architecture: A Quantitative Approach). Basically, you have to look at how many clock cycles each instruction takes, in addition to the number of instructions (and clock rate) to know how long it will take. To know how much energy will be consumed, you also need to know how much energy each instruction takes.
How the CPU implements each instruction affects how many cycles it takes to execute. As an example, your code sequence has a >> operator. The compiler might translate that to a single ASR instruction, but without knowing the architecture, there is no telling how many clock cycles it might take -- some architectures can do an arbitrary shift in a single cycle, while others need one cycle for each bit shift.
Memory access contributes to the number of cycles and power consumption, too. When there are too many variables to store in registers some of them will have to be stored in memory. If you are accessing off chip memory and have a fairly high CPU clock rate, the memory bus can be pretty power hungry. A longer sequence of instructions that avoids reading from and writing to memory (e.g., by computing the same result twice) can be less expensive.
As several others have suggested, there is no substitute for benchmarking. Assuming you are using a microcontroller-based system with a constant input voltage, your best bet is to measure the current draw of your system with each alternative set of code and see which does best (one way would be with a current probe and a digital storage oscilloscope).
Even if you can always write better assembler than the compiler, there is a cost in development time and maintainability. In The Mythical Man Month Brooks estimated 3-5x more effort at time when many, if not most, programmers wrote code in assembler. Unless your code is really tiny, you are probably best off only coding the most critical parts in assembly. Even so, the person writing the assembly should be able to prove that their (more expensive) code is worth the cost by comparing running code vs. running code.
If the question is "how can I measure the efficiency with which a compiler generates code" (your actual question), the answer is "that depends". It depends on how you define "efficiency". Mostly, compilers are designed to optimize for speed. As you change the optimization level (-O1, -O2, -O3), the compiler will spend more time looking for "clever things to do to make it just a bit faster". This can involve loop unrolling, order of execution, use of registers, and many other things.
It seems that your "efficiency" criterion is not one that compilers are designed for: you say you want "fewest cycles" because you think that == lowest power. However I would argue that "fastest execution" == "shortest time before processor can go into standby mode again". Unless you believe that the power consumption of the processor in "awake" mode changes significantly with instructions executed, I think that it is safe to say that fastest execution == shortest time awake == lowest power consumption.
In which case "fat code" doesn't matter - it's back to speed only. Note also that not all instructions take the same number of clock cycles (although to be fair, that depends on the processor).
EDIT, okay that was fun...
Folks that make the blanket statement that compilers outperform humans, are the ones that have not actually checked. Anything a compiler can create a human can create. But a compiler cannot always create the code a human can create. It is that simple. For projects anywhere from a few lines to a few dozen lines or larger, it becomes easier and easier to hand fix the optimizations made by a compiler. Compiler and target help close that gap but there will always be the educated someone that will be able to meet or exceed the compilers output.
The claim I would like to prove or disprove is that compilers generate
code that is 2-4X "fatter" (and therefore consume 2-4X times as much
power) compared with hand written assembly code.
Unless you are defining "fatter" to mean uses that much power. Size of a binary and power consumption are not related. If this whole question/project is related to power consumption, the compiler wont take into account the bios settings you have chosen (assuming you are talking about pcs), the video card, hard disk, monitor, mouse, keyboard, etc, etc. In addition to the processor which is only one (relatively small) part of the equation. And even if it did would someone make a compiler that only makes your code efficient, they cant and wont tune the compiler for every system on the planet. Aint gonna happen.
If you are talking a mobile phone which is a very controlled environment the app may get tuned to save power, but the compiler is not the master of that, it is the user, the compiler does part of it the rest is hand tuned by the programmer.
I can compile the above snippet with -O3 and read the assembly, but I couldn't write it myself.
If you go into this with that kind of attitude then you have automatically failed. Yes you can meet or beat the compiler, period. It is a matter of confidence and will power and time/effort. That statement means you have not really studied the problem, which is why you are asking the question you are asking. Take some time, do some more research, ask detailed questions at stackoverflow (not open ended ones like this one), and with time you will understand what compilers do and dont do and why, in particular why they are not perfect (for any one or many rulers by which that opinion is defined). This question is wholly about opinion and will spark flame wars, and such and will get closed and eventually removed from this site. Instead write and compile and publish code segments and ask questions about "why did the compiler produce this output, why didnt it to [this] instead?" Those kinds of questions have a better chance at real answers and of staying here for others to learn from.

Profiling floating point usage in C

Is there an easy way to count the number of multiplications actually executed by a piece of standard C code? The code I have in mind basically just does additions and multiplications, and it's the multiplications that are of primary interest, but it wouldn't hurt to get counts of the other operations as well.
If it were an option, I suppose I could go around replacing 'a * b' with 'multiply(a, b)' and write a cover function for the native * operator, b/c I really don't care about time performance during this test, but the primary objection to doing that is having to re-work a pile of source code just to run the test.
I have no objection to re-compiling the source, perhaps against some library or with obscure (afaik) options. Valgrind came to mind, but if I understand valgrind's purpose, that's more about tracing values than counting operations.
Compile the source code into assembly language and then search for the multiply instructions.
Note that the optimization level can greatly affect the number that appear. For loops, you would have to determine the scope of multiplies within a loop and factor that into the result, but if the code is fairly constrained or limited in extent, that should be straightforward.
Note: a shameless extrapolation of my comment for as much rep as I can skim.
PAPI has two high-level API functions called PAPI_flips and PAPI_flops which can be used to record the FLOPS as well as the number of floating point operations. Additionally, PAPI offers lots of other performance counter monitoring capability, depending on your processor architecture... cache, bus, memory, branches, etc. I think there is support or support is emerging for graphics accelerators and CUDA/GPGPU.
PAPI will need to be installed on your system, but I think it's widespread enough that installation wouldn't be too painful, if you know what you're doing.
The nice thing about PAPI is that you don't need to know anything about the code; just instrument it (the interface is the same as a stopwatch for FLOPS) and run it. It's based on the actual dynamic execution of your program, so it takes into account things that are hard to account for analytically, such as (pseudo-)random behavior, user/variable input, and related branches.
If your compiler supports soft-float (i.e. using functions with integer implementations to emulate floating-point), you could compiler your program in that mode (-msoft-float in GCC), and use your favorite profiling tool to measure how many times they are invoked.
Many processors also have performance counters that can count the number of floating-point operations that have been retired. Depending on the hardware and OS, you may or may not need some amount of kernel support to take advantage of them.
The best that I can think of is (assuming you're running gdb):
If you could identify the points were multiplications are occurring, you could then set tracepoints just prior to the multiplication (or perhaps just after them depending on the details), then run the program and count the number of tracepoint dumps.
Yes, it is very crude. Certainly there are other solutions; however, I would hesitate to trash my stack for something as simple as a count.

function to purposefully have a high cpu load?

I'm making a program which controls other processes (as far as stopped and started states).
One of the criteria is that the load on the computer is under a certain value.
So i need a function or small program which will cause the load to be very high for testing purposes. Can you think of anything?
Thanks.
I can think of this one :
for(;;);
If you want to actually generate peak load on a CPU, you typically want a modest-size (so that the working set fits entirely in cache) trivially parallelizable task, that someone has hand-optimized to use the vector unit on a processor. Common choices are things like FFTs, matrix multiplications, and basic operations on mathematical vectors.
These almost always generate much higher power and compute load than do more number-theoretic tasks like primality testing because they are essentially branch-free (other than loops), and are extremely homogeneous, so they can be designed to use the full compute bandwidth of a machine essentially all the time.
The exact function that should be used to generate a true maximum load varies quite a bit with the exact details of the processor micro-architecture (different machines have different load/store bandwidth in proportion to the number and width of multiply and add functional units), but numerical software libraries (and signal processing libraries) are great things to start with. See if any that have been hand tuned for your platform are available.
If you need to control how long will the CPU burst be, you can use something like the Sieve of Eratosthenes (algorithm to find primes until a certain number) and supply a smallish integer (10000) for short bursts, and a big integer (100000000) for long bursts.
If you will take I/O into account for the load, you can write to a file per each test in the sieve.

How to do good benchmarking of complex functions?

I am about to embark in very detailed benchmarking of a set of complex functions in C. This is "science level" detail. I'm wondering, what would be the best way to do serious benchmarking? I was thinking about running them, say, 10 times each, averaging the timing results and give the standard dev, for instance, just using <time.h>. What would you guys do to obtain good benchmarks?
Reporting an average and standard deviation gives a good description of a distribution when the distribution in question is approximately normal. However, this is rarely true of computational performance measurements. Instead, performance measurements tend to more closely resemble a poisson distribution. This makes sense, because not many random events on a computer will cause a program to go faster; essentially all of the measurement noise is in how many random events occur that cause it to slow down. (A normal distribution, by contrast, makes no intuitive sense at all; it would require the belief that a program has a non-zero probability of finishing in negative time).
In light of this, I find it most useful to report the minimum time over many runs of a program, rather than the average; the noise in the distribution is typically noise of the measuring system, rather than meaningful information about the algorithm. For complex algorithms that have early out conditions, and other shortcuts, you need to be a little more careful, but the minimum of many runs where each run handles a representative balance of inputs usually works well.
"10 times each" sounds like very few iterations to me. I generally do something on the order of thousands (or more, depending on the function/system) of runs unless that's completely infeasible. At a bare minimum, you need to make sure that you run the timing for sufficiently long as to shake out any dependence on system state, some of which may change at fairly large time granularity.
The other thing you should be aware of is that essentially every system has a platform-specific timer available that is much more accurate than what is available <time.h>. Find out what it is on your target platform[s] and use it instead.
I am assuming you are looking at benchmarking pure Algorithmic computation in your program and there is no user input or output which can take unpredictable time.
Now for purely number crunching programs, your results could vary based on the time your program actually runs which will be impacted by other ongoing activities in the system. There could be other factor which you may choose to ignore depending upon level of accuracy desired i.e. impact due to cache miss, different access time through the memory hierarchy"
One of the methods is as you suggested calculation average over a number of runs.
Or you could try to look at the assembly code and see the instructions generated. And then based on the processor get the cycle count for these instructions. This method may not be practical depending on the amount of code you are looking to benchmark. If you are particular about memory hierarchy impact then you may want to control execution environment very carefully i.e. where program is loaded, where its data is loaded etc. But as I mentioned depending on the accuracy desired, you may absorb the variation caused due to memory hierarchy in you statistical variation" .
You may need to carefully design the test input for you functions to ensure the path coverage and may choose to publish statistics of performance as a function of test input. This will show how function behaves across range of inputs

How to calculate MIPS for an algorithm for ARM processor

I have been asked recently to produced the MIPS (million of instructions per second) for an algorithm we have developed. The algorithm is exposed by a set of C-style functions. We have exercise the code on a Dell Axim to benchmark the performance under different input.
This question came from our hardware vendor, but I am mostly a HL software developer so I am not sure how to respond to the request. Maybe someone with similar HW/SW background can help...
Since our algorithm is not real time, I don't think we need to quantify it as MIPS. Is it possible to simply quote the total number of assembly instructions?
If 1 is true, how do you do this (ie. how to measure the number of assembly instructions) either in general or specifically for ARM/XScale?
Can 2 be performed on a WM device or via the Device Emulator provided in VS2005?
Can 3 be automated?
Thanks a lot for your help.
Charles
Thanks for all your help. I think S.Lott hit the nail. And as a follow up, I now have more questions.
5 Any suggestion on how to go about measuring MIPS? I heard some one suggest running our algorithm and comparing it against Dhrystone/Whetstone benchmark to calculate MIS.
6 Since the algorithm does not need to be run in real time, is MIPS really a useful measure? (eg. factorial(N)) What are other ways to quantity the processing requirements? (I have already measured the runtime performance but it was not a satisfactory answer.)
7 Finally, I assume MIPS is a crude estimate and would be dep. on compiler, optimization settings, etc?
I'll bet that your hardware vendor is asking how many MIPS you need.
As in "Do you need a 1,000 MIPS processor or a 2,000 MIPS processor?"
Which gets translated by management into "How many MIPS?"
Hardware offers MIPS. Software consumes MIPS.
You have two degrees of freedom.
The processor's inherent MIPS offering.
The number of seconds during which you consume that many MIPS.
If the processor doesn't have enough MIPS, your algorithm will be "slow".
if the processor has enough MIPS, your algorithm will be "fast".
I put "fast" and "slow" in quotes because you need to have a performance requirement to determine "fast enough to meet the performance requirement" or "too slow to meet the performance requirement."
On a 2,000 MIPS processor, you might take an acceptable 2 seconds. But on a 1,000 MIPS processor this explodes to an unacceptable 4 seconds.
How many MIPS do you need?
Get the official MIPS for your processor. See http://en.wikipedia.org/wiki/Instructions_per_second
Run your algorithm on some data.
Measure the exact run time. Average a bunch of samples to reduce uncertainty.
Report. 3 seconds on a 750 MIPS processor is -- well -- 3 seconds at 750 MIPS. MIPS is a rate. Time is time. Distance is the product of rate * time. 3 seconds at 750 MIPS is 750*3 million instructions.
Remember Rate (in Instructions per second) * Time (in seconds) gives you Instructions.
Don't say that it's 3*750 MIPS. It isn't; it's 2250 Million Instructions.
Some notes:
MIPS is often used as a general "capacity" measure for processors, especially in the soft real-time/embedded field where you do want to ensure that you do not overload a processor with work. Note that this IS instructions per second, as the time is very important!
MIPS used in this fashion is quite unscientific.
MIPS used in this fashion is still often the best approximation there is for sizing a system and determining the speed of the processor. It might well be off by 25%, but never mind...
Counting MIPS requires a processor that is close to what you are using. The right instruction set is obviously crucial, to capture the actual instruction stream from the actual compiler in use.
You cannot in any way approximate this on a PC. You need to bring out one of a few tools to do this right:
Use an instruction-set simulator for the target archicture such as Qemu, ARM's own tools, Synopsys, CoWare, Virtutech, or VaST. These are fast but can count instructions pretty well, and will support the right instruction set. Barring extensive use of expensive instructions like integer divide (and please no floating point), these numbers tend to be usefully close.
Find a clock-cycle accurate simulator for your target processor (or something close), which will give pretty good estimate of pipeline effects etc. Once again, get it from ARM or from Carbon SoCDesigner.
Get a development board for the processor family you are targeting, or an ARM close to it design, and profile the application there. You don't use an ARM9 to profile for an ARM11, but an ARM11 might be a good approximation for an ARM Cortex-A8/A9 for example.
MIPS is generally used to measure the capability of a processor.
Algorithms usually take either:
a certain amount of time (when running on a certain processor)
a certain number of instructions (depending on the architecture)
Describing an algorithm in terms of instructions per second would seem like a strange measure, but of course I don't know what your algorithm does.
To come up with a meaningful measure, I would suggest that you set up a test which allows you to measure the average time taken for your algorithm to complete. Number of assembly instructions would be a reasonable measure, but it can be difficult to count them! Your best bet is something like this (pseudo-code):
const num_trials = 1000000
start_time = timer()
for (i = 1 to num_trials)
{
runAlgorithm(randomData)
}
time_taken = timer() - start_time
average_time = time_taken / num_trials
MIPS are a measure of CPU speed, not algorithm performance. I can only assume the somewhere along the line, someone is slightly confused. What are they trying to find out? The only likely scenario I can think of is they're trying to help you determine how fast a processor they need to give you to run your program satisfactorily.
Since you can measure an algorithm in number of instructions (which is no doubt going to depend on the input data, so this is non-trivial), you then need some measure of time in order to get MIPS -- for instance, say "I need to invoke it 1000 times per second". If your algorithm is 1000 instructions for that particular case, you'll end up with:
1000 instructions / (1/1000) seconds = 1000000 instructions per second = 1 MIPS.
I still think that's a really odd way to try to do things, so you may want to ask for clarification. As for your specific questions, I'll leave that to someone more familiar with Visual Studio.
Also remember that different compilers and compiler options make a HUGE difference. The same source code can run at many different speeds. So instead of buying the 2mips processor you may be able to use the 1/2mips processor and use a compiler option. Or spend the money on a better compiler and use the cheaper processor.
Benchmarking is flawed at best. As a hobby I used to compile the same dhrystone (and whetstone) code on various compilers from various vendors for the same hardware and the numbers were all over the place, orders of magnitude. Same source code same processor, dhrystone didnt mean a thing, not useful as a baseline. What matters in benchmarking is how fast does YOUR algorithm run, it had better be as fast or faster than it needs to. Depending on how close to the finish line you are allow for plenty of slop. Early on on probably want to be running 5 or 10 or 100 times faster than you need to so that by the end of the project you are at least slightly faster than you need to be.
I agree with what I think S. Lott is saying, this is all sales and marketing and management talk. Being the one that management has put between a rock and the hard place then what you need to do is get them to buy the fastest processor and best tools that they are willing to spend based on the colorful pie charts and graphs that you are going to generate from thin air as justification. If near the end of the road it doesnt quite meet performance, then you could return to stackoverflow, but at the same time management will be forced to buy a different toolchain at almost any price or swap processors and respin the board. By then you should know how close to the target you are, we need 1.0 and we are at 1.25 if we buy the processor that is twice as fast as the one we bought we should make it.
Whether or not you can automate these kinds of things or simulate them depends on the tools, sometimes yes, sometimes no. I am not familiar with the tools you are talking about so I cant speak to them directly.
This response is not intended to answer the question directly, but to provide additional context around why this question gets asked.
MIPS for an algorithm is only relevant for algorithms that need to respond to an event within the required time.
For example, consider a controller designed to detect the wind speed and move the actuator within a second when the wind speed crosses over 25 miles / hour. Let us say it takes 1000 instructions to calculate and compare the wind speed against the threshold. The MIPS requirement for this algorithm is 1 Kilo Instructions Per Second (KIPs). If the controller is based on 1 MIPS processor, we can comfortably say that there is more juice in the controller to add other functions.
What other functions could be added on the controller? That depends on the MIPS of the function/algorithm to be added. If there is another function that needs 100,000 instructions to be performed within a second (i.e. 100 KIPs), we can still accommodate this new function and still have some room for other functions to add.
For a first estimate a benchmark on the PC may be useful.
However, before you commit to a specific device and clock frequency you should get a developer board (or some PDA?) for the ARM target architecture and benchmark it there.
There are a lot of factors influencing the speed on today's machines (caching, pipelines, different instruction sets, ...) so your benchmarks on a PC may be way off w.r.t. the ARM.

Resources