Running the benchmark with a multicore simulator will only get the same data? - benchmarking

When using a CPU simulator such as sniper or gem5 to run a benchmark program such as SPEC2006 in conjunction with the MCPAT power consumption model, whether a benchmark program only gets one set of power data .
For example, if I run the FFT benchmark, I will only get the same set of power data no matter how many times I run it.

I use McPat after collecting the stats of my simulations and it gives me different power data based benchmark and number of instructions simulated. You could try writing a script to convert m5 stats file to xml and use McPat to calculate the average power.

Related

Simulink Clock Synchronisation

In simulink if I run any simulation, it follows an internal clock. I want to run these simulations in real time.
Example: if I use a PWM pulse generator and give it a sample time of 1 second, I expect that it will generate a sample at the end of every one second real-time but the simulink clock moves very very fast (every one second real time corresponds to about 1e6 seconds smulink time). Is there any way to synchronize the simulink clock with the real time clock?
I actually need to give input to hardware at the end of every 2 seconds in a loop and that is why this kind of synchronization is needed.
Firstly note that Simulink is not a real-time environment, so anything you do related to this is not guaranteed to be anything but approximate in the timing achieved.
If your model runs faster than real-time then it can be paused at each time step until clock-time and simulation time are (approximately) equal. This is achieved by writing an S-Function.
There are several examples of doing this. For instance here or here.

Scale Down the CPU frequency

Is there any way to tell the kernel that I don't need the full CPU power?
Basically, I want to do some calculation while waiting for another process. But I don't need the full CPU power for that. As the CPU load during the computation is still 100%, the frequency is high. I want to tell the kernel that I am satisfied with a lower CPU frequency in order to save energy.
Instead of calculating using the full frequency and then suspend to wait for the other process, I want to try calculating with lower frequency so that the CPU is not in a lower C state when the other process has finished and the frequency can scale back again.
This doesn't make any sense on a multi-process system, particularly not in Linux. The CPU frequency is a very basic parameter, which affects everything running on the computer - including other processes and the OS itself.
If your program would fiddle with the CPU frequency, it would not only dictate the priority of itself, but also the priority of everything in the computer, including the OS. This isn't possible to do on a desktop system, simply because it doesn't make any sense to have a single application process dictate things that not even the OS dares to meddle with.
If saving power is a priority, you should probably look for completely different alternatives than some desktop Linux solution. PC computers only care about 1) speed, 2) speed and also 3) speed.
The kind of things you ask for are common in real-time embedded systems, where CPUs have a "sleep mode", from which it can wake up to execute something, then go back to sleep. It would usually also be possible for such systems to fiddle with the internal PLL to adjust their own frequency, but such solutions are rare. The industry standard way of doing things is to perform all calculations at max speed, then revert to power-saving sleep mode.
In case of multiple cores - there is a way to specify a certain cpu core to an interrupt. In this way you can save a certain CPU to a certain process:
to find the irq number of the task use:
cat /proc/interrupts
look for your irq number.
lets say that the irp number is 99, so in order to set core #2 to handle this irq, do:
echo 2 > /proc/irq/99/smp_affinity
in this way you can save a certain core to handle your special process.
You can actually use nice() to tell the kernel that your process can live with a lower-than-normal scheduling priority. This effectively reduces the amount of time slices your process will get to use the CPU (typically in favor of other processes running at the same time).
On some more modern systems, if this will reduce the overall CPU load significantly, the CPU might eventually even decide to run on lower frequency. But you typically don't have direct influence on that decision.
Note: Depending on the system, you might be having problems to restore the original nice value (i.e. to scale up on priority again) without running with appropriate permissions.
In case your application is I/O bound and is not doing stupid things wasting CPU cycles like busy-waiting, it shouldn't be necessary to revert to reducing your nice value - Modern CPUs and operating systems should be able to detect themselves when the system is mainly idling around and step down autonomously.
modify scalling_goverance accordingly.
The "scaling_governor" feature enables setting a static frequency to the CPU.
Frequency value must be between scaling_min_freq and scaling_max_freq.
When CPU frequency governor is set to "powersave" mode, CPU is set to the lowest static frequency (within the borders of scaling_min_freq and scaling_max_freq).
Check in below path on the target
cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_available_governors
and select the required scalling governanceby writing to
echo "powersaving"/sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
To tune the performance, few of the files can be updated which will make changes in CPU and the frequency and also the scheduler policies.
Based on the performance analysis, and the load balancing.
Modifications can be adopted.
Check /sys/devices/system/cpu/
Example:-
root:/sys/devices/system/cpu/cpu0/cpufreq# cat scaling_available_governo
rs
interactive performance
root#:/sys/devices/system/cpu/cpu0/cpufreq#
root#:/sys/devices/system/cpu/cpu0/cpufreq# cat scaling_governor
interactive
root#:/sys/devices/system/cpu/cpu0/cpufreq# cat scaling_available_frequen
cies
400000 800000 998400 1094400 1190400 1248000 1305600
##:/sys/devices/system/cpu/cpu0/cpufreq# cat scaling_available_frequen
cies
400000 800000 998400 1094400 1190400 1248000 1305600

Regarding CPU utilization

Considering the below piece of C code, I expected the CPU utilization to go up to 100% as the processor would try to complete the job (endless in this case) given to it. On running the executable for 5 mins, I found the CPU to go up to a max. of 48%. I am running Mac OS X 10.5.8; processor: Intel Core 2 Duo; Compiler: GCC 4.1.
int i = 10;
while(1) {
i = i * 5;
}
Could someone please explain why the CPU usage does not go up to 100%? Does the OS limit the CPU from reaching 100%?
Please note that if I added a "printf()" inside the loop the CPU hits 88%. I understand that in this case, the processor also has to write to the standard output stream hence the sharp rise in usage.
Has this got something to do with the amount of job assigned to the processor per unit time?
Regards,
Ven.
You have a multicore processor and you are in a single thread scenario, so you will use only one core full throttle ... Why do you expect the overall processor use go to 100% in a similar context ?
Run two copies of your program at the same time. These will use both cores of your "Core 2 Duo" CPU and overall CPU usage will go to 100%
Edit
if I added a "printf()" inside the loop the CPU hits 88%.
The printf send some characters to the terminal/screen. Sending information, Display and Update is handeled by code outside your exe, this is likely to be executed on another thread. But displaying a few characters does not need 100% of such a thread. That is why you see 100% for Core 1 and 76% for Core 2 which results in the overal CPU usage of 88% what you see.

average time between kernel launch and execution?

If I understand correctly, when you launch a CUDA kernel asynchronously, it may begin execution immediately or it may wait for previous asynchronous calls (transfers, kernels, etc) to complete first. (I also understand that kernels can run concurrently in some cases, but I want to ignore that for now).
How can I find out the time between launching a kernel ("queuing") and when it actually begins execution. In fact, I really just want to know the average "queued time" for all launches in a single run of my program (generally in the tens or hundreds of thousands of kernel launches.)
I can easily calculate the average execution time per kernel with events (~500us). I tried to simulate - I dropped the results of CLOCK() every time a kernel is launched, with the idea that I could then determine how long the launch queue was when each kernel was launched. But CLOCK() does not have high enough precision (0.01s) - sometimes as many as 60 kernels appear to be launched at a single time, when of course in reality many are not.
Rather than clock use the QueryPerformanceTimer which counts based on machine clock cycles.
Code for QueryPerformanceTimer
Secondly, the profiling tool (Visual Profiler) only measures serial launches [see page 24] and [see post number 3].
Thus the best option is (1) use QueryPerformanceTimer (or the Visual Profiler) such that you get an accurate measurement of a single launch and (2) use QueryPerformanceTimer to get the timing of multiple launches and observe whether the timing results suggest that asynchronous launching took place.

How to measure the power consumed by a C algorithm while running on a Pentium 4 processor?

How can I measure the power consumed by a C algorithm while running on a Pentium 4 processor (and any other processor will also do)?
Since you know the execution time, you can calculate the energy used by the CPU by looking up the power consumption on the P4 datasheet. For example, a 2.2 GHz P4 with a 400 MHz FSB has a typical Vcc of 1.3725 Volts and Icc of 47.9 Amps which is (1.3725*47.9=) 65.74 watts. Since you know your loop of 10,000 algorithm cycles took 46.428570s, you assume a single loop will take 46.428570/10000 = 0.00454278s. The amount of energy consumed by your algorithm would then be 65.74 watts * 0.00454278s = 0.305 watt seconds (or joules).
To convert to kilowatt hours: 0.305 watt seconds * 1000 kilowatts/watt * 1 hour / 3600 seconds = 0.85 kwh. A utility company charges around $0.11 per kwh so this algorithm would cost 0.85 kwh * $0.11 = about a penny to run.
Keep in mind this is the CPU only...none of the rest of the computer.
Run your algorithm in a long loop with a Kill-a-Watt attached to the machine?
Excellent question; I upvoted it. I haven't got a clue, but here's a methodology:
-- get CPU spec sheet from Intel (or AMD or whoever) or see Wikipedia; that should tell you power consumption at max FLOP rate;
-- translate algorithm into FLOPs;
-- do some simple arithmetic;
-- post your data and calculations to SO and invite comments and further data
Of course, you'll have to frame your next post as another question, I'll watch with interest.
Unless you run the code on a simple single tasking OS such as DOS or and RTOS where you get precise control of what runs at any time, the OS will typically be running many other processes simultaneously. It may be difficult to distinguish between your process and any others.
First, you need to be running the simplest OS that supports your code (probably a server version unix of some sort, I expect this to be impractical on Windows). That's to avoid the OS messing up your measurements.
Then you need to instrument the box with a sensitive datalogger between the power supply and motherboard. This is going to need some careful hardware engineering so as not to mess up the PCs voltage regulation, but someone must have done it.
I have actually done this with an embedded MIPS box and a logging multimeter, but that had a single 12V power supply. Actually, come to think of it, if you used a power supply built for running a PC in a vehicle, you would have a 12V supply and all you'd need then is a lab PSU with enough amps to run the thing.
It's hard to say.
I would suggest you to use a Current Clamp, so you can measure all the power being consumed by your CPU. Then you should measure the idle consumption of your system, get the standard value with as low a standard deviation as possible.
Then run the critical code in a loop.
Previous suggestions about running your code under DOS/RTOS are also valid, but maybe it will not compile the same way as your production...
Sorry, I find this question senseless.
Why ? Because an algorithm itself has (with the following exceptions*) no correlation with the power consumption, it is the priority on the program/thread/process runs. If you change the priority, you change the amount of idle time the processor has and therefore the power consumption. I think the only difference in energy consumption between the instructions is the number of cycles needed, so fast code will be power friendly.
To measure power consumption of a "algorithm" is senseless if you don't mean the performance.
*Exceptions: Threads which can be idle while waiting for other threads, programs which use the HLT instruction.
Sure running the processor at fast as possible increases the amount of energy superlinearly
(more heat, more cooling needed), but that is a hardware problem. If you want to spare energy, you can downclock the processor or use energy-efficient ones (Atom processor), but changing/tweaking the code won't change anything.
So I think it makes much more sense to ask the processor producer for specifications what different processor modes exist and what energy consumption they have. You also need to know that the periphery (fan, power supply, graphics card (!)) and the running software on the system will influence the results of measuring computer power.
Why do you need this task anyway ?

Resources