When benchmarking what would make the elapsed CPU time LESS than the user CPU time - benchmarking

Following up on a question I posed earlier:
I ended up with a User CPU time and Total CPU time that was about 4% longer in duration than the elapsed real time. Based on the accepted answer to my earlier question, I don't understand how this could be the case. Could anyone explain this?

Multithreaded code on multiple cores can use more than 100% CPU time.

Because if I use two CPUs at 100% for 10 minutes, I've used 20 minutes worth of CPU time (i.e. were one of those CPUs disabled, it would take 20 minutes for my operation to complete)

One possibility to benchmarks being off by a small margin is due to incorrect timer resolution.
There are quite a few ways of determining those values (time, ticks, CPU frequency, OS API, etc) so not all benchmark routines are 100% reliable.

Related

Real time and cpu time measurement difference - firstly, using clock() and gtod(), secondly using time command on console? [duplicate]

I can take a guess based on the names, but what specifically are wall-clock-time, user-cpu-time, and system-cpu-time in Unix?
Is user-cpu time the amount of time spent executing user-code while kernel-cpu time the amount of time spent in the kernel due to the need of privileged operations (like I/O to disk)?
What unit of time is this measurement in?
And is wall-clock time really the number of seconds the process has spent on the CPU or is the name just misleading?
Wall-clock time is the time that a clock on the wall (or a stopwatch in hand) would measure as having elapsed between the start of the process and 'now'.
The user-cpu time and system-cpu time are pretty much as you said - the amount of time spent in user code and the amount of time spent in kernel code.
The units are seconds (and subseconds, which might be microseconds or nanoseconds).
The wall-clock time is not the number of seconds that the process has spent on the CPU; it is the elapsed time, including time spent waiting for its turn on the CPU (while other processes get to run).
Wall clock time: time elapsed according to the computer's internal clock, which should match time in the outside world. This has nothing to do with CPU usage; it's given for reference.
User CPU time and system time: exactly what you think. System calls, which include I/O calls such as read, write, etc. are executed by jumping into kernel code and executing that.
If wall clock time < CPU time, then you're executing a program in parallel. If wall clock time > CPU time, you're waiting for disk, network or other devices.
All are measured in seconds, per the SI.
time [WHAT-EVER-COMMAND]
real 7m2.444s
user 76m14.607s
sys 2m29.432s
$ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 24
real or wall-clock
real 7m2.444s
On a system with a 24 core-processor, this cmd/process took more than 7 minutes to complete. That by utilizing the most possible parallelism with all given cores.
user
user 76m14.607s
The cmd/process has utilized this much amount of CPU time.
In other words, on machine with single core CPU, the real and user will be nearly equal, so the same command will take approximately 76 minutes to complete.
sys
sys 2m29.432s
This is the time taken by the kernel to execute all the basic/system level operations to run this cmd, including context switching, resource allocation, etc.
Note: The example assumes that your command utilizes parallelism/threads.
Detailed man page: https://linux.die.net/man/1/time
Wall clock time is exactly what it says, the time elapsed as measured by the clock on your wall (or wristwatch)
User CPU time is the time spent in "user land", that is time spent on non-kernel processes.
System CPU time is time spent in the kernel, usually time spent servicing system calls.

Calling a C function periodically on OSX

I have a function which calculates a BPM for a track from incoming data packets from a CDJ. Lets say the BPM was 124.45 beats per minute, how would I go about calling a function every 0.482 seconds (i.e. once per beat)? Would it be possible to set up another thread and set a timer?
Maybe have a look at high precision timers, here for which Apple claim 500 micrososecond accuracy which is 0.1% of your 500 (ish) millisecond requirement. You can minimise skew by reading the time at the start of your processing and calculating an offset to the next beat. Also, if you find you are often getting scheduled late, and missing beats, you can sleep for, say, 95% of the time to your next beat so the CPU can schedule something else, and then busy wait for the last few percent so you don't hog the CPU.

Internal working of timer

I know QueryPerformanceCounter() can be used for timing functions. I want to know:
1-Can I increase the resolution of the timer by over-clocking the CPU (so it ticks faster)?
2-Basically what makes some timers more precise than others, (e.g, QueryPerformanceCounter() is more precise as compared to GetTickCount())? If there is single crystal oscillator on the motherboard , why some timers are slower as compared to others?
QueryPerformanceCounter has very high resolution - normally less than one nanosecond. I don't see why you'd like to increase it. Overclocking will increase it, but it seems like a very weak reason for overclocking.
QueryPerformanceCounter is very accurate, but somewhat expensive and not very convenient.
a. It's expensive because it uses the expensive rdtsc instruction. Faster timers can just read an integer from memory. This integer needs to be updated, and we don't want to do it too often (1000 times a second is reasonable), so we get a very cheap timer, with low precision. That's basically GetTickCount.
b. It's inconvenient because it uses units which change between computers. Sometimes it will be nanoseconds, sometimes half-nano, or other values. It makes it harder to calculate with.
a. Another source of inconvenience is that it returns very large numbers, which may overflow when you try to do math with them, so you need to be careful.
The timing source for QPC is machine dependent. It is typically picked up from a frequency available somewhere in the chipset. Whether overclocking the cpu is going to affect it is highly dependent on your motherboard design. The simplest way is to just try it, use QueryPerformanceFrequency to see the effect.
GetTickCount is driven from an entirely different timer source, the signal that also generates the clock interrupt. It is not very precise, 1/64 of second normally, but it is highly accurate. Your machine contacts a time server from time to time to recalibrate the clock and adjust the clock correction factor. Which makes it accurate to about a second over an entire year. QPC is very precise, but not nearly as accurate. Use it only to time short intervals.
1 - Yes, Internally, one of the better timers is rdtsc, which does give you the clock value. Combining this with information from cpuid instruction, gives you time.
2 - The other timers rely upon various timing sources, such as the 8253 timer, for instance.
QPF is a wrapper added by Microsoft on and over what rdtsc provides. Read this article for more info:
http://www.strchr.com/performance_measurements_with_rdtsc

How to measure CPU cycles per instruction in a C program

I have a C program where I am starting to use some SIMD optimizations for the SPE (Cell processor), etc. I would like somehow to "time" how many cycles do they need. One idea is to switch on/off and measure whole execution time. But this is slow. I can also add between and before the execution gettimeofday(&start,NULL) and so statements, but they are only precise I think when one copes with more than miliseconds.
I wonder if it is possible to measure efficiently the nanoseconds per instruction or just the CPU cycles or some other precise timing measure.
Depending on your CPU you may be able to get at performance registers within the CPU itself which track instruction clocks and many other useful things. Profilers and other performance utilities can do this, so it should also be possible from user code too. On Mac OS X I would use the Apple CHUD framework, but you didn't state what OS or CPU you are using so it's hard to give specific suggestions.
Execute the code to be tested in a loop and divide the time it takes with the loop counter. The timer you use must not be high resolution to measure correct values.
Nano seconds won't be enough for that. You need picoseconds.
I don't think that you can measure something like this reliably. You will have to look into specifications (I'm not sure if current CPUs have this information documented).
As a not C guy... my guess is you need to look at the assembly code, and go from there. The only problem is a single instruction could take 1 or 100000 cpu cycles, depending on the exact CPU you are on.

How to measure the power consumed by a C algorithm while running on a Pentium 4 processor?

How can I measure the power consumed by a C algorithm while running on a Pentium 4 processor (and any other processor will also do)?
Since you know the execution time, you can calculate the energy used by the CPU by looking up the power consumption on the P4 datasheet. For example, a 2.2 GHz P4 with a 400 MHz FSB has a typical Vcc of 1.3725 Volts and Icc of 47.9 Amps which is (1.3725*47.9=) 65.74 watts. Since you know your loop of 10,000 algorithm cycles took 46.428570s, you assume a single loop will take 46.428570/10000 = 0.00454278s. The amount of energy consumed by your algorithm would then be 65.74 watts * 0.00454278s = 0.305 watt seconds (or joules).
To convert to kilowatt hours: 0.305 watt seconds * 1000 kilowatts/watt * 1 hour / 3600 seconds = 0.85 kwh. A utility company charges around $0.11 per kwh so this algorithm would cost 0.85 kwh * $0.11 = about a penny to run.
Keep in mind this is the CPU only...none of the rest of the computer.
Run your algorithm in a long loop with a Kill-a-Watt attached to the machine?
Excellent question; I upvoted it. I haven't got a clue, but here's a methodology:
-- get CPU spec sheet from Intel (or AMD or whoever) or see Wikipedia; that should tell you power consumption at max FLOP rate;
-- translate algorithm into FLOPs;
-- do some simple arithmetic;
-- post your data and calculations to SO and invite comments and further data
Of course, you'll have to frame your next post as another question, I'll watch with interest.
Unless you run the code on a simple single tasking OS such as DOS or and RTOS where you get precise control of what runs at any time, the OS will typically be running many other processes simultaneously. It may be difficult to distinguish between your process and any others.
First, you need to be running the simplest OS that supports your code (probably a server version unix of some sort, I expect this to be impractical on Windows). That's to avoid the OS messing up your measurements.
Then you need to instrument the box with a sensitive datalogger between the power supply and motherboard. This is going to need some careful hardware engineering so as not to mess up the PCs voltage regulation, but someone must have done it.
I have actually done this with an embedded MIPS box and a logging multimeter, but that had a single 12V power supply. Actually, come to think of it, if you used a power supply built for running a PC in a vehicle, you would have a 12V supply and all you'd need then is a lab PSU with enough amps to run the thing.
It's hard to say.
I would suggest you to use a Current Clamp, so you can measure all the power being consumed by your CPU. Then you should measure the idle consumption of your system, get the standard value with as low a standard deviation as possible.
Then run the critical code in a loop.
Previous suggestions about running your code under DOS/RTOS are also valid, but maybe it will not compile the same way as your production...
Sorry, I find this question senseless.
Why ? Because an algorithm itself has (with the following exceptions*) no correlation with the power consumption, it is the priority on the program/thread/process runs. If you change the priority, you change the amount of idle time the processor has and therefore the power consumption. I think the only difference in energy consumption between the instructions is the number of cycles needed, so fast code will be power friendly.
To measure power consumption of a "algorithm" is senseless if you don't mean the performance.
*Exceptions: Threads which can be idle while waiting for other threads, programs which use the HLT instruction.
Sure running the processor at fast as possible increases the amount of energy superlinearly
(more heat, more cooling needed), but that is a hardware problem. If you want to spare energy, you can downclock the processor or use energy-efficient ones (Atom processor), but changing/tweaking the code won't change anything.
So I think it makes much more sense to ask the processor producer for specifications what different processor modes exist and what energy consumption they have. You also need to know that the periphery (fan, power supply, graphics card (!)) and the running software on the system will influence the results of measuring computer power.
Why do you need this task anyway ?

Resources