which code is consuming less power? - c

My goal is to develop and implement a green algorithm for some special situation. I have developed two algorithms for the same.
One is having large no. of memory accesses(load and store). The pattern is some time coalesced and some time non-coalesced. I am assuming a worst case where most of the access will result in cache failure. See sample Code snippet a).
Another is having large no. of calculations, roughly equivalent to the code snippet b) below.
How do I estimate power consumption in each case. Which one is more energy efficient and why?
Platform: I will be running these codes on Intel I3 processor, with Windows 7, with 4 GB DRAM, 3 MB Cache.
Note: I do not want to use any external power meter. Also please ignore if you find the code not doing any constructive job. This is because it is only fraction of the complete algorithm.
UPDATE:
It is difficult but not impossible. One can very well calculate the cost incurred in reading DRAMs and doing multiplications by an ALU of the CPU. The only thing is one must have required knowledge of electronics of DRAMS and CPU, which I am lacking at this point of time. At least in worst case I think this can very well be established. Worst case means no coalesced access, no compiler optimization.
If you can estimate the cost of accessing DRAM and doing a float multiplication , then why is it impossible for estimating the current, hence a rough idea of power during these operations? Also see me post, I am not asking how much power consumption is there, rather I am asking which code is consuming less/more power or which one is more energy efficient?
a) for(i=0; i<1000000; i++)
{
a[i]= b[i]; //a, b floats in RAM.
{
b) for(i=1; i<1000000; i++)
{
float j= j * i; //j has some value. which is used later in the program , not
// shown here
{

To measure the actual power consumption you should use add an electricity meter to your power supply (remove the batteries if using a notebook).
Note that you will measure the power consumption of the entire system, so make sure to avoid nuisance parameters (any other system activity, i.e. anti-virus updates, graphical desktop environment, indexing services, (internal) hardware devices), perform measurements repeatedly, with and without your algorithms running to cancel out "background" consumption.
If possible use an embedded system.
Concerning your algorithms, the actual energy efficiency depends not only on the C code but also on the performance of the compiler and also the runtime behavior in interaction with the surrounding system. However, here are some resources what you can do as developer to help on this:
Energy-Efficient Software Checklist
Energy-Efficient Software Guidelines
Especially take a look on the paragraph Tools in above "Checklist", as it lists some tools that may help you on rough estimates (based on application profiling). It lists (besides others):
Perfmon
PwrTest/Windows Driver Kit
Windows Event Viewer (Timer tick change events, Microsoft-Windows-Kernel-PowerDiagnostic log)
Intel PowerInformer
Windows ETW (performance monitoring framework)
Intel Application Energy Toolkit

Like commenters pointed out, try using a power meter. Estimating power usage even from raw assembly code is difficult if not impossible on modern superscalar architectures.

As long as only the CPU and memory are involved, you can assume power consumption is proportional to runtime.
This may not be 100% accurate, but as near as you can get without an actual measurement.

You can try using some CPU monitoring tools to see which algorithm heats your CPU more. It won't give you solid data but will show whether there is significant difference between these two algorithms in terms of power consumption.
Here I assume that the major power consumer is CPU and algorithms do not require heavy I/O.

Well, I have done with my preliminary research and discussions with electronics majors.
A rough idea can be get by considering two factors:
1- Currents involved: more current, more power dissipation.
2- Power Dissipation due to clock rate. Power dissipation varies with the square of the frequency.
In Snippet a) the DRAMs and memories hardly take much current, so the power dissipation will be very small during each
a[i]= b[i];
operation. The above operation is nothing but data read and write.
Also the clocking for memories is usually very small as compared to CPU. While CPU is clocked at 3 GHz, memory is clocked at about 133MHz or so. (Not all components are running at rated clock). Thus power dissipation is lower because of lower clock.
In snippet b), It can be seen that I am doing more calculations. This will involve more power dissipation because of several order higher clock frequency.
Another factor is multiplication itself would comprise of several order higher no. of cycles as compared to data read write (provided memory is coalesced).
Also, it would be great to have an option for measuring or getting a rough idea of power dissipation ("code energy" )for some code as shown below(the color represent how energy efficient your code is, Red being very poor, and green being highly energy efficient ):
In short given today's technologies it is not very difficult for softwares to estimate power like this (and possibly taking many other parameters, in addition to how I described above). This will be useful for faster development and evaluation of green algorithms.

Related

should use GPU?

how can i know if my serial code will run faster if i used a GPU? i know it depends on a lot of things... ie if the code could be parallalized in an SMID fation and all this stuff... but what considerations should i take into account to be "sure" that i will gain speed? should the algorithm be embarrassingly parallel? therefore i wouldn't bother trying the GPU if parts of the algorithm cannot be parallelized? should i take into consideration how much memory is required for a sample input?
what are the "specs" of a serial code that would make it run faster on a GPU? can a complex algorithm gain speed on a GPU?
i don't want to waste time and try to code my algorithm on GPU and i am 100% sure that speed will be gained.... that is my problem....
i think that my algorithm could be parallelized on GPU... would it be worth trying it?
It depends upon two factors:
1) The speedup of having many cores performing the floating point operations
This is dependent upon the inherent parallelization of the operations you are performing, the number of cores on your GPU, and the differences in clock rates between your CPU and GPU.
2) The overhead of transferring the data back and forth between main memory and GPU memory.
This is mainly dependent upon the "memory bandwidth" of your particular GPU, and is greatly reduced by the Sandy Bridge architecture where the CPU and GPU are on the same die. With older architectures, some operations such as matrix multiplication where the inner dimensions are small get no improvement. This is because it takes longer to transfer the inner vectors back and forth across the system bus than it does to dot product the vectors on the CPU.
Unfortunately these two factors are tough to estimate and there is no way to "know" without trying it. If you currently use BLAS for your SIMD operations, it is fairly simple to substitute in CUBLAS which has the same API except it sends the operations over to the GPU to perform.
When looking for a parallel solution you should typically ask yourself the questions
The amount of data you have.
The amount of floating point computation you have.
How complicated is your algorithm i.e. conditions and branches in the algorithm. Is there any data localization?
what kind of speedup is required?
Is it Realtime computation or not?
Do alternate algorithms exist (but maybe they are not the most efficient serial algorithm)?
What kind of sw/hw you have access to.
Depending on the answers you are looking for you may want to use GPGPU, cluster computation or distributed computation or a combination of GPU and cluster/distributed machines.
If you could share the any information on your algorithm and size of data then it would be easier to comment.
Regular C code can be converted to CUDA remarkably easily. If the heavy hitters in your algorithm's profile can be parallelized, try it and see if it helps.

function to purposefully have a high cpu load?

I'm making a program which controls other processes (as far as stopped and started states).
One of the criteria is that the load on the computer is under a certain value.
So i need a function or small program which will cause the load to be very high for testing purposes. Can you think of anything?
Thanks.
I can think of this one :
for(;;);
If you want to actually generate peak load on a CPU, you typically want a modest-size (so that the working set fits entirely in cache) trivially parallelizable task, that someone has hand-optimized to use the vector unit on a processor. Common choices are things like FFTs, matrix multiplications, and basic operations on mathematical vectors.
These almost always generate much higher power and compute load than do more number-theoretic tasks like primality testing because they are essentially branch-free (other than loops), and are extremely homogeneous, so they can be designed to use the full compute bandwidth of a machine essentially all the time.
The exact function that should be used to generate a true maximum load varies quite a bit with the exact details of the processor micro-architecture (different machines have different load/store bandwidth in proportion to the number and width of multiply and add functional units), but numerical software libraries (and signal processing libraries) are great things to start with. See if any that have been hand tuned for your platform are available.
If you need to control how long will the CPU burst be, you can use something like the Sieve of Eratosthenes (algorithm to find primes until a certain number) and supply a smallish integer (10000) for short bursts, and a big integer (100000000) for long bursts.
If you will take I/O into account for the load, you can write to a file per each test in the sieve.

Can we benchmark how fast CUDA or OpenCL is compared to CPU performance?

How much faster can an algorithm on CUDA or OpenCL code run compared to a general single processor core? (considering the algorithm is written and optimized for both the CPU and GPU target).
I know it depends on both the graphics card and the CPU, but say, one of the fastest GPUs of NVIDIA and a (single core of a) Intel i7 processor ?
And I know it also depends on the type of algorithm.
I do not need a strict answer, but experienced examples like: for a image manipulation algorithm using double-precision floating point and 10 operations per pixel took first 5 minutes and now runs in x seconds using this hardware.
Your question is overly broad, and very difficult to answer. Moreover only a small percentage of algorithms (the ones that deal without much shared state) are feasable with GPUs.
But I do want to urge you to be critical about claims. I'm in imageprocessing, and read many an article on the subject, but quite often in the GPU case, the time to upload input data to the GPU, and download the results back to main memory is not included in the calculation of the factor.
While there are a few cases where this doesn't matter (both are small or there is a second stage calculation that further reduces the result in size), usually one does have to transfer the results and initial data.
I've seen this turning a claimed plus into a negative, because the upload/download time alone was longer than the main CPU would require to do the calculation.
Pretty much the same thing applies to combining results of different GPU cards.
Update Newer GPUs seem to be able to upload/download and calculate at the same time using ping-pong buffers. But the advise to check the border conditions thoroughly still stands. There is a lot of spin out there.
Update 2 Quite often using a GPU that is shared with video output for this is not optimal. Consider e.g. adding a low budget card for video, and using the onboard video for GPGPU tasks
I think that this video introduction to OpenCL gives a good answer to your question in the first or second episode (I do not remember). I think it was at the end of the first episode...
In general it depends on how well you can "parallelize" the problem. The problem size itself is also a factor, because it costs time to copy the data to the graphics card.
It depends very much on the algorithm and how efficient the implementation can be.
Overall, it's fair to say that GPU is better at computation than CPUs. Thus, an upper bound is to divide the theoretical GFlops rating of a top end GPU by a top end CPU. You can do similar computation for theoretical memory bandwidth.
For example, 1581.1 GFlops for a GTX580 vs. a 107.55 GFLOPS for i7 980XE. Note that the rating for GTX580 is for single precision. I believe you need to cut that down by a factor of 4 for Fermi class non-Tesla to get to the double precision rating. So in this instance, you might expect roughly 4x.
Caveats on why you might do better (or see results which claim far bigger speedups):
GPUs has better memory bandwidth than CPU once the data is on the card. Sometimes, memory bound algorithms can do well on the GPU.
Clever use of caches (texture memory etc.) which can let you do better than advertised bandwidth.
Like Marco says, the transfer time didn't get included. I personally always include such time in my work and thus have found that the biggest speedups I've seen to be in iterative algorithms where all the data fits on the GPU (I've gotten over 300x on a midrange CPU to midrange GPU here personally).
Apples to orange comparisons. Comparing a top end GPU vs. a low end CPU is inherently unfair. The rebuttal is that a high end CPU costs much more than a high end GPU. Once you go to a GFlops/$ or GFlops/Watt comparison, it can look much more favorable to the GPU.
__kernel void vecAdd(__global float* results )
{
int id = get_global_id(0);
}
this kernel code can spawn 16M threads on a new 60$ R7-240 GPU in 10 milliseconds.
This is equivalent to 16 thread creations or context switches in 10 nanoseconds. What is a 140$ FX-8150 8-core CPU timing? It is 1 thread in 50 nanoseconds per core.
Every instruction added in this kernel is a win for a gpu until it makes branching.
Your question is in general, hard to answer; there are simply many different variables that make it hard to give answers that are either accurate, or fair.
Notably, you are comparing both 1) choice of algorithm 2) relative performance of hardware 3) compiler optimisation ability 4) choice of implementation languages and 5) efficiency of algorithm implementation, all at the same time...
Note that, for example, different algorithms may be preferable on GPU vs CPU; and data transfers to and from GPU need to be accounted for in timings, too.
AMD has a case study (several, actually) in OpenCL performance for OpenCL code executing on the CPU and on the GPU. Here is one with performance results for sparse matrix vector multiply.
I've seen figures ranging from 2x to 400x. I also know that the middle-range GPUs cannot compete with high-range CPUs in double-precision computation - MKL on a 8-core Xeon will be faster than CULA or CUBLAS on an $300 GPU.
OpenCL is anecdotally much slower than CUDA.
A new benchmark suite called SHOC (Scalable Heterogeneous Computing) from Oak Ridge National Lab and Georgia Tech has both OpenCL and CUDA implementations of many important kernels. You can download the suite from http://bit.ly/shocmarx. Enjoy.

Estimate Power Consumption Based on Running Time Analysis / Code Size

I've developed and tested a C program on my PC and now I want to give an estimate of the power consumption required for the program to do a single run. I've analysised the running time of the application and of invidiual function calls within the application and I know the code size both in assembly lines, but also raw C lines.
How would I give an estimate of the power consumption based on the performance analysis and/code size? I suppose it scales with the amount of lines that uses the CPU for computations or does memory access but I was hoping for a more precise answer.
Also, how would I tell the difference between the power consumption on say my PC compared to a on a microchip device?
Good luck. What you want to do is pretty much impossible on a desktop PC. Best you could probably do would be to measure the from-the-wall power draw at idle, and when running your program, with as few other programs as possible running at the same time. Average the results over 100 or so runs, and you should have a value with accuracy of a few percent (standard statistical disclaimers apply).
On a Microchip device, it should be easier to calculate the power consumption, since they publish (average) power consumption values for the various modes, and the timing is deterministic. Unfortunately, there are so many differences between a processor like that and your desktop processor (word size, pipelining, multiple-issue, multiple processes, etc, etc) that there really won't be any effective way to compare the two.
There is a paper on Intel's website that gives average energy per instruction for various processors. They give 11 nJ per instruction for Core Duo, for example. How useful that'll be for you depends on how much your code looks like the SpecInt benchmark, I guess.

How to calculate MIPS for an algorithm for ARM processor

I have been asked recently to produced the MIPS (million of instructions per second) for an algorithm we have developed. The algorithm is exposed by a set of C-style functions. We have exercise the code on a Dell Axim to benchmark the performance under different input.
This question came from our hardware vendor, but I am mostly a HL software developer so I am not sure how to respond to the request. Maybe someone with similar HW/SW background can help...
Since our algorithm is not real time, I don't think we need to quantify it as MIPS. Is it possible to simply quote the total number of assembly instructions?
If 1 is true, how do you do this (ie. how to measure the number of assembly instructions) either in general or specifically for ARM/XScale?
Can 2 be performed on a WM device or via the Device Emulator provided in VS2005?
Can 3 be automated?
Thanks a lot for your help.
Charles
Thanks for all your help. I think S.Lott hit the nail. And as a follow up, I now have more questions.
5 Any suggestion on how to go about measuring MIPS? I heard some one suggest running our algorithm and comparing it against Dhrystone/Whetstone benchmark to calculate MIS.
6 Since the algorithm does not need to be run in real time, is MIPS really a useful measure? (eg. factorial(N)) What are other ways to quantity the processing requirements? (I have already measured the runtime performance but it was not a satisfactory answer.)
7 Finally, I assume MIPS is a crude estimate and would be dep. on compiler, optimization settings, etc?
I'll bet that your hardware vendor is asking how many MIPS you need.
As in "Do you need a 1,000 MIPS processor or a 2,000 MIPS processor?"
Which gets translated by management into "How many MIPS?"
Hardware offers MIPS. Software consumes MIPS.
You have two degrees of freedom.
The processor's inherent MIPS offering.
The number of seconds during which you consume that many MIPS.
If the processor doesn't have enough MIPS, your algorithm will be "slow".
if the processor has enough MIPS, your algorithm will be "fast".
I put "fast" and "slow" in quotes because you need to have a performance requirement to determine "fast enough to meet the performance requirement" or "too slow to meet the performance requirement."
On a 2,000 MIPS processor, you might take an acceptable 2 seconds. But on a 1,000 MIPS processor this explodes to an unacceptable 4 seconds.
How many MIPS do you need?
Get the official MIPS for your processor. See http://en.wikipedia.org/wiki/Instructions_per_second
Run your algorithm on some data.
Measure the exact run time. Average a bunch of samples to reduce uncertainty.
Report. 3 seconds on a 750 MIPS processor is -- well -- 3 seconds at 750 MIPS. MIPS is a rate. Time is time. Distance is the product of rate * time. 3 seconds at 750 MIPS is 750*3 million instructions.
Remember Rate (in Instructions per second) * Time (in seconds) gives you Instructions.
Don't say that it's 3*750 MIPS. It isn't; it's 2250 Million Instructions.
Some notes:
MIPS is often used as a general "capacity" measure for processors, especially in the soft real-time/embedded field where you do want to ensure that you do not overload a processor with work. Note that this IS instructions per second, as the time is very important!
MIPS used in this fashion is quite unscientific.
MIPS used in this fashion is still often the best approximation there is for sizing a system and determining the speed of the processor. It might well be off by 25%, but never mind...
Counting MIPS requires a processor that is close to what you are using. The right instruction set is obviously crucial, to capture the actual instruction stream from the actual compiler in use.
You cannot in any way approximate this on a PC. You need to bring out one of a few tools to do this right:
Use an instruction-set simulator for the target archicture such as Qemu, ARM's own tools, Synopsys, CoWare, Virtutech, or VaST. These are fast but can count instructions pretty well, and will support the right instruction set. Barring extensive use of expensive instructions like integer divide (and please no floating point), these numbers tend to be usefully close.
Find a clock-cycle accurate simulator for your target processor (or something close), which will give pretty good estimate of pipeline effects etc. Once again, get it from ARM or from Carbon SoCDesigner.
Get a development board for the processor family you are targeting, or an ARM close to it design, and profile the application there. You don't use an ARM9 to profile for an ARM11, but an ARM11 might be a good approximation for an ARM Cortex-A8/A9 for example.
MIPS is generally used to measure the capability of a processor.
Algorithms usually take either:
a certain amount of time (when running on a certain processor)
a certain number of instructions (depending on the architecture)
Describing an algorithm in terms of instructions per second would seem like a strange measure, but of course I don't know what your algorithm does.
To come up with a meaningful measure, I would suggest that you set up a test which allows you to measure the average time taken for your algorithm to complete. Number of assembly instructions would be a reasonable measure, but it can be difficult to count them! Your best bet is something like this (pseudo-code):
const num_trials = 1000000
start_time = timer()
for (i = 1 to num_trials)
{
runAlgorithm(randomData)
}
time_taken = timer() - start_time
average_time = time_taken / num_trials
MIPS are a measure of CPU speed, not algorithm performance. I can only assume the somewhere along the line, someone is slightly confused. What are they trying to find out? The only likely scenario I can think of is they're trying to help you determine how fast a processor they need to give you to run your program satisfactorily.
Since you can measure an algorithm in number of instructions (which is no doubt going to depend on the input data, so this is non-trivial), you then need some measure of time in order to get MIPS -- for instance, say "I need to invoke it 1000 times per second". If your algorithm is 1000 instructions for that particular case, you'll end up with:
1000 instructions / (1/1000) seconds = 1000000 instructions per second = 1 MIPS.
I still think that's a really odd way to try to do things, so you may want to ask for clarification. As for your specific questions, I'll leave that to someone more familiar with Visual Studio.
Also remember that different compilers and compiler options make a HUGE difference. The same source code can run at many different speeds. So instead of buying the 2mips processor you may be able to use the 1/2mips processor and use a compiler option. Or spend the money on a better compiler and use the cheaper processor.
Benchmarking is flawed at best. As a hobby I used to compile the same dhrystone (and whetstone) code on various compilers from various vendors for the same hardware and the numbers were all over the place, orders of magnitude. Same source code same processor, dhrystone didnt mean a thing, not useful as a baseline. What matters in benchmarking is how fast does YOUR algorithm run, it had better be as fast or faster than it needs to. Depending on how close to the finish line you are allow for plenty of slop. Early on on probably want to be running 5 or 10 or 100 times faster than you need to so that by the end of the project you are at least slightly faster than you need to be.
I agree with what I think S. Lott is saying, this is all sales and marketing and management talk. Being the one that management has put between a rock and the hard place then what you need to do is get them to buy the fastest processor and best tools that they are willing to spend based on the colorful pie charts and graphs that you are going to generate from thin air as justification. If near the end of the road it doesnt quite meet performance, then you could return to stackoverflow, but at the same time management will be forced to buy a different toolchain at almost any price or swap processors and respin the board. By then you should know how close to the target you are, we need 1.0 and we are at 1.25 if we buy the processor that is twice as fast as the one we bought we should make it.
Whether or not you can automate these kinds of things or simulate them depends on the tools, sometimes yes, sometimes no. I am not familiar with the tools you are talking about so I cant speak to them directly.
This response is not intended to answer the question directly, but to provide additional context around why this question gets asked.
MIPS for an algorithm is only relevant for algorithms that need to respond to an event within the required time.
For example, consider a controller designed to detect the wind speed and move the actuator within a second when the wind speed crosses over 25 miles / hour. Let us say it takes 1000 instructions to calculate and compare the wind speed against the threshold. The MIPS requirement for this algorithm is 1 Kilo Instructions Per Second (KIPs). If the controller is based on 1 MIPS processor, we can comfortably say that there is more juice in the controller to add other functions.
What other functions could be added on the controller? That depends on the MIPS of the function/algorithm to be added. If there is another function that needs 100,000 instructions to be performed within a second (i.e. 100 KIPs), we can still accommodate this new function and still have some room for other functions to add.
For a first estimate a benchmark on the PC may be useful.
However, before you commit to a specific device and clock frequency you should get a developer board (or some PDA?) for the ARM target architecture and benchmark it there.
There are a lot of factors influencing the speed on today's machines (caching, pipelines, different instruction sets, ...) so your benchmarks on a PC may be way off w.r.t. the ARM.

Resources