Throughput calculation using cycle count

Throughput calculation using cycle count - c

Is it possible to determine the throughput of an application on a processor from the cycle counts (Processor instruction cycles) consumed by the application ? If yes, how to calculate it ?

If the process is entirely CPU bound, then you divide the processor speed by the number of cycles to get the throughput.
In reality, few processes are entirely CPU bound though, in which case you have to take other factors (disk speed, memory speed, serialization, etc.) into account.

Simple:
#include <time.h>
clock_t c;
c = clock(); // c holds clock ticks value
c = c / CLOCKS_PER_SEC; // real time, if you need it
Note that the value you get is an approximation, for more info see the clock() man page.

Some CPUs have internal performance registers which enable you to collect all sorts of interesting statistics, such as instruction cycles (sometimes even on a per execution unit basis), cache misses, # of cache/memory reads/writes, etc. You can access these directly, but depending on what CPU and OS you are using there may well be existing tools which manage all the details for you via a GUI. Often a good profiling tool will have support for performance registers and allow you to collect statistics using them.

If you use the Cortex-M3 from TI/Luminary Micro, you can make use of the driverlib delivered by TI/Luminary Micro.
Using the SysTick functions you can set the SysTickPeriod to 1 processor cycle: So you have 1 processor clock between interrupts. By counting the number of interrupts you should get a "near enough estimation" on how much time a function or function block take.

Related

Embedded Programming, Wait for 12.5 us

I'm programming on the C2000 F28069 Experimenters Kit. I'm toggling a GPIO output every 12.5 microseconds 5 times in a row. I decided I don't want to use interrupts (though I will if I absolutely have to). I want to just wait that amount of times in terms of clock cycles.
My clock is running at 80MHz, so 12.5 us should be 1000 clock cycles. When I use a loop:
for(i=0;i<1000;i++)
I get a result that is way too long (not 12.5 us). What other techniques can I use?
Is sleep(n); something that I can use on a microcontroller? If so, which header file do I need to download and where can I find it? Also, now that I think about it, sleep(n); takes an int input, so that wouldn't even work... any other ideas?

Summary: Use the PWM or Timer peripherals to generate output pulses.
First, the clock speed of the CPU has a complex relationship to actual code execution speed, and in many CPUs there is more than one clock rate involved in different stages of the execution. The chip you reference has several internal clock sources, for instance. Further, each individual instruction will likely take a different number of clocks to execute, and some cores can execute part of (or all of) several instructions simultaneously.
To rigorously create a loop that required 12.5 µs to execute without using a timing interrupt or other hardware device would require careful hand coding in assembly language along with careful accounting of the execution time of each instruction.
But you are writing in C, not assembler.
So the first question you have to ask is what machine code was actually generated for your loop. And the second question is did you enable the optimizer, and to what level.
As written, a decent optimizer will determine that the loop for (i=0; i<1000; i++) ; has no visible side effects, and therefore is just a slow way of writing ;, and can be completely removed.
If it does compile the loop, it could be written naively using perhaps as many as 5 instructions, or as few as one or two. I am not personally familiar with this particular TI CPU architecture, so I won't attempt to guess at the best possible implementation.
All that said, learning about the CPU architecture and its efficiency is important to building reliable and efficient embedded systems. But given that the chip has peripheral devices built-in that provide hardware support for PWM (pulse width modulated) outputs as well as general purpose hardware timer/counters you would be far better off learning to use the hardware to generate the waveform for you.
I would start by collecting every document available on the CPU core and its peripherals, especially app notes and sample code.
The C compiler will have an option to emit and preserve an assembly language source file. I would use that as a guide to study the structure of the code generated for critical loops and other bottlenecks, as well as the effects of the compiler's various optimization levels.
The tool suite should have a mechanism for profiling your running code. Before embarking on heroic measures in pursuit of optimizations, use that first to identify the actual bottlenecks. Even if it lacks decent profiling, you are likely to have spare GPIO pins that can be toggled around critical sections of code and measured with a logic analyzer or oscilloscope.

The chip you refer has PWM (pulse width modulation) hardware declared as one of major winning features. You should rely on this. Please refer to appropriate application guide. Generally you cannot guarantee 12.5uS periods from application layer (and should not try to do so). Even if you managed to do so directly from application layer it's bad idea. Any change in your firmware code can break this.

If you use a timer peripheral with PWM output capability as suggested by #RBerteig already, then you can generate an accurate timing signal with zero software overhead. If you need to do other work synchronously with the clock, then you can use the timer interrupt to trigger that too. However if you process interrupts at an interval of 12.5us you may find that your processor spends a great deal of time context switching rather than performing useful work.
If you simply want an accurate delay, then you should still use a hardware timer and poll its reload flag rather than process its interrupt. This allows consistent timing independent of the compiler's code generation or processor speed and allows you to add other code within the loop without extending the total loop time. You would poll it in a loop during which you might do other work as well. The timing jitter and determinism will depend on what other work you do in the loop, but for an empty loop, reaction to the timer even will probably be faster than the latency on an interrupt handler.

Measure CPU frequency (x86 / x64)

I'm looking for some kind of a library that gives me accurate CPU frequency values periodically on both Intel and AMD processors, on 32-bit and 64-bit Windows.
The purpose of this is to accuratly measure CPU load on a given computer. The problem is that calling QueryPerformanceCounter() returns clock ticks (used to measure the duration of an activity) but the underlying CPU frequency is not constant because of SpeedStep or TurboBoost. I've found several computers where turning off SpeedStep / TurboBoost in the BIOS and doesn't prevent CPU frequency scaling based on load.
I'm trying to see if there are any libraries available that could be used to detect CPU frequency changes (much like how Throttlestop / CPU-Z or even the Overview tab of Resource Monitor in Windows 7) so that I could query and save this information along with my other measurements. Performance counters don't seem to return reliable information, as I have computers that always return 100% CPU frequency, even when other tools show dynamic frequency changes.
I searched for such libraries but most results come back with gadgets, etc., that are not useful.

You can combine a high-resolution timer with a clock cycle counter to compute the current clock rate. On modern CPUs, the cycle counter can be read with this function:
static inline uint64_t get_cycles()
{
uint64_t t;
asm volatile ("rdtsc" : "=A"(t));
return t;
}
Note that this is per CPU, so if your program gets moved around CPUs, you're in trouble. If you know about CPU pinning techniques on your platform, you might like to try those.
For high resolution time measurement, you can use the tools in <chrono>; here's a semi-useful post of mine on the topic.

Try to focus on what you are trying to do, and not on how to do it.
What is your ultimate goal?
If, as you say, you are trying to "measure CPU load on a given computer", on Windows it may be a good practice using "PdhOpenQuery" and the "Pdh*" family functions.
See this SO answer as well:
How to determine CPU and memory consumption from inside a process?

Consider looking at the __rdtsc intrinsic function (#include "intrin.h" in Visual Studio).
This yields the clock count directly from the processor via the x86/x64 function RDTSC (Read Timestamp).

ARM926EJ-S cycle-counter

Im using an ARM926EJ-S and am trying to figure out whether the ARM can give (e.g. a readable register) the CPU's cycle-counter. I guess a # that will represent the number of cycles since the CPU has been powered.
In my system i have only Low-Res external RTC/Timers. I would like to be able to achieve a Hi-Res timer.
Many thanks in advance!

You probably have only two choices:
Use an instruction-cycle accurate simulator; the problem here is that effectively simulating peripherals and external stimulus can be complex or impossible.
Use a peripheral hardware timer. In most cases you will not be able to run such a timer at the typical core clock rate of an ARM9, and there will be an over head in servicing the timer either side of the period being timed, but it can be used to give execution time over larger or longer running sections of code, which may be of more practical use than cycle count.
While cycle count may be somewhat scalable to different clock rates, it remains constrained by memory and I/O wait states, so is perhaps not as useful as it may seem as a performance metric, except at the micro-level of analysis, and larger performance gains are typically to be had by taking a wider view.

The arm-9 is not equipped with an PMU (Performance Monitoring Unit) as included in the Cortex-family. The PMU is described here. The linux kernel comes equipped with support for using the PMU for benchmarking performance. See here for documentation of the perf tool-set.
Bit unsure about the arm-9, need to dig a bit more...

How to measure the power consumed by a C algorithm while running on a Pentium 4 processor?

How can I measure the power consumed by a C algorithm while running on a Pentium 4 processor (and any other processor will also do)?

Since you know the execution time, you can calculate the energy used by the CPU by looking up the power consumption on the P4 datasheet. For example, a 2.2 GHz P4 with a 400 MHz FSB has a typical Vcc of 1.3725 Volts and Icc of 47.9 Amps which is (1.3725*47.9=) 65.74 watts. Since you know your loop of 10,000 algorithm cycles took 46.428570s, you assume a single loop will take 46.428570/10000 = 0.00454278s. The amount of energy consumed by your algorithm would then be 65.74 watts * 0.00454278s = 0.305 watt seconds (or joules).
To convert to kilowatt hours: 0.305 watt seconds * 1000 kilowatts/watt * 1 hour / 3600 seconds = 0.85 kwh. A utility company charges around $0.11 per kwh so this algorithm would cost 0.85 kwh * $0.11 = about a penny to run.
Keep in mind this is the CPU only...none of the rest of the computer.

Run your algorithm in a long loop with a Kill-a-Watt attached to the machine?

Excellent question; I upvoted it. I haven't got a clue, but here's a methodology:
-- get CPU spec sheet from Intel (or AMD or whoever) or see Wikipedia; that should tell you power consumption at max FLOP rate;
-- translate algorithm into FLOPs;
-- do some simple arithmetic;
-- post your data and calculations to SO and invite comments and further data
Of course, you'll have to frame your next post as another question, I'll watch with interest.

Unless you run the code on a simple single tasking OS such as DOS or and RTOS where you get precise control of what runs at any time, the OS will typically be running many other processes simultaneously. It may be difficult to distinguish between your process and any others.

First, you need to be running the simplest OS that supports your code (probably a server version unix of some sort, I expect this to be impractical on Windows). That's to avoid the OS messing up your measurements.
Then you need to instrument the box with a sensitive datalogger between the power supply and motherboard. This is going to need some careful hardware engineering so as not to mess up the PCs voltage regulation, but someone must have done it.
I have actually done this with an embedded MIPS box and a logging multimeter, but that had a single 12V power supply. Actually, come to think of it, if you used a power supply built for running a PC in a vehicle, you would have a 12V supply and all you'd need then is a lab PSU with enough amps to run the thing.

It's hard to say.
I would suggest you to use a Current Clamp, so you can measure all the power being consumed by your CPU. Then you should measure the idle consumption of your system, get the standard value with as low a standard deviation as possible.
Then run the critical code in a loop.
Previous suggestions about running your code under DOS/RTOS are also valid, but maybe it will not compile the same way as your production...

Sorry, I find this question senseless.
Why ? Because an algorithm itself has (with the following exceptions*) no correlation with the power consumption, it is the priority on the program/thread/process runs. If you change the priority, you change the amount of idle time the processor has and therefore the power consumption. I think the only difference in energy consumption between the instructions is the number of cycles needed, so fast code will be power friendly.
To measure power consumption of a "algorithm" is senseless if you don't mean the performance.
*Exceptions: Threads which can be idle while waiting for other threads, programs which use the HLT instruction.
Sure running the processor at fast as possible increases the amount of energy superlinearly
(more heat, more cooling needed), but that is a hardware problem. If you want to spare energy, you can downclock the processor or use energy-efficient ones (Atom processor), but changing/tweaking the code won't change anything.
So I think it makes much more sense to ask the processor producer for specifications what different processor modes exist and what energy consumption they have. You also need to know that the periphery (fan, power supply, graphics card (!)) and the running software on the system will influence the results of measuring computer power.
Why do you need this task anyway ?

Assembly CPU frequency measuring algorithm

What are the common algorithms being used to measure the processor frequency?

Intel CPUs after Core Duo support two Model-Specific registers called IA32_MPERF and IA32_APERF.
MPERF counts at the maximum frequency the CPU supports, while APERF counts at the actual current frequency.
The actual frequency is given by:
You can read them with this flow
; read MPERF
mov ecx, 0xe7
rdmsr
mov mperf_var_lo, eax
mov mperf_var_hi, edx
; read APERF
mov ecx, 0xe8
rdmsr
mov aperf_var_lo, eax
mov aperf_var_hi, edx
but note that rdmsr is a privileged instruction and can run only in ring 0.
I don't know if the OS provides an interface to read these, though their main usage is for power management, so it might not provide such an interface.

I'm gonna date myself with various details in this answer, but what the heck...
I had to tackle this problem years ago on Windows-based PCs, so I was dealing with Intel x86 series processors like 486, Pentium and so on. The standard algorithm in that situation was to do a long series of DIVide instructions, because those are typically the most CPU-bound single instructions in the Intel set. So memory prefetch and other architectural issues do not materially affect the instruction execution time -- the prefetch queue is always full and the instruction itself does not touch any other memory.
You would time it using the highest resolution clock you could get access to in the environment you are running in. (In my case I was running near boot time on a PC compatible, so I was directly programming the timer chips on the motherboard. Not recommended in a real OS, usually there's some appropriate API to call these days).
The main problem you have to deal with is different CPU types. At that time there was Intel, AMD and some smaller vendors like Cyrix making x86 processors. Each model had its own performance characteristics vis-a-vis that DIV instruction. My assembly timing function would just return a number of clock cycles taken by a certain fixed number of DIV instructions done in a tight loop.
So what I did was to gather some timings (raw return values from that function) from actual PCs running each processor model I wanted to time, and record those in a spreadsheet against the known processor speed and processor type. I actually had a command-line tool that was just a thin shell around my timing function, and I would take a disk into computer stores and get the timings off of display models! (I worked for a very small company at the time).
Using those raw timings, I could plot a theoretical graph of what timings I should get for any known speed of that particular CPU.
Here was the trick: I always hated when you would run a utility and it would announce that your CPU was 99.8 Mhz or whatever. Clearly it was 100 Mhz and there was just a small round-off error in the measurement. In my spreadsheet I recorded the actual speeds that were sold by each processor vendor. Then I would use the plot of actual timings to estimate projected timings for any known speed. But I would build a table of points along the line where the timings should round to the next speed.
In other words, if 100 ticks to do all that repeating dividing meant 500 Mhz, and 200 ticks meant 250 Mhz, then I would build a table that said that anything below 150 was 500 Mhz, and anything above that was 250 Mhz. (Assuming those were the only two speeds available from that chip vendor). It was nice because even if some odd piece of software on the PC was throwing off my timings, the end result would often still be dead on.
Of course now, in these days of overclocking, dynamic clock speeds for power management, and other such trickery, such a scheme would be much less practical. At the very least you'd need to do something to make sure the CPU was in its highest dynamically chosen speed first before running your timing function.
OK, I'll go back to shooing kids off my lawn now.

One way on x86 Intel CPU's since Pentium would be to use two samplings of the RDTSC instruction with a delay loop of known wall time, eg:
#include <stdio.h>
#include <stdint.h>
#include <unistd.h>
uint64_t rdtsc(void) {
uint64_t result;
__asm__ __volatile__ ("rdtsc" : "=A" (result));
return result;
}
int main(void) {
uint64_t ts0, ts1;
ts0 = rdtsc();
sleep(1);
ts1 = rdtsc();
printf("clock frequency = %llu\n", ts1 - ts0);
return 0;
}
(on 32-bit platforms with GCC)
RDTSC is available in ring 3 if the TSC flag in CR4 is set, which is common but not guaranteed. One shortcoming of this method is that it is vulnerable to frequency scaling changes affecting the result if they happen inside the delay. To mitigate that you could execute code that keeps the CPU busy and constantly poll the system time to see if your delay period has expired, to keep the CPU in the highest frequency state available.

I use the following (pseudo)algorithm:
basetime=time(); /* time returns seconds */
while (time()==basetime);
stclk=rdtsc(); /* rdtsc is an assembly instruction */
basetime=time();
while (time()==basetime
endclk=rdtsc();
nclks=encdclk-stclk;
At this point you might assume that you've determined the clock frequency but even though it appears correct it can be improved.
All PCs contain a PIT (Programmable Interval Timer) device which contains counters which are (used to be) used for serial ports and the system clock. It was fed with a frequency of 1193182 Hz. The system clock counter was set to the highest countdown value (65536) resulting in a system clock tick frequency of 1193182/65536 => 18.2065 Hz or once every 54.925 milliseconds.
The number of ticks necessary for the clock to increment to the next second will therefore depend. Usually 18 ticks are required and sometimes 19. This can be handled by performing the algorithm (above) twice and storing the results. The two results will either be equivalent to two 18 tick sequences or one 18 and one 19. Two 19s in a row won't occur. So by taking the smaller of the two results you will have an 18 tick second. Adjust this result by multiplying with 18.2065 and dividing by 18.0 or, using integer arithmetic, multiply by 182065, add 90000 and divide by 180000. 90000 is one half of 180000 and is there for rounding. If you choose the calculation with integer route make sure you are using 64-bit multiplication and division.
You will now have a CPU clock speed x in Hz which can be converted to kHz ((x+500)/1000) or MHz ((x+5000000)/1000000). The 500 and 500000 are one half of 1000 and 1000000 respectively and are there for rounding. To calculate MHz do not go via the kHz value because rounding issues may arise. Use the Hz value and the second algorithm.

That was the intention of things like BogoMIPS, but CPUs are a lot more complicated nowadays. Superscalar CPUs can issue multiple instructions per clock, making any measurement based on counting clock cycles to execute a block of instructions highly inaccurate.
CPU frequencies are also variable based on offered load and/or temperature. The fact that the CPU is currently running at 800 MHz does not mean it will always be running at 800 MHz, it might throttle up or down as needed.
If you really need to know the clock frequency, it should be passed in as a parameter. An EEPROM on the board would supply the base frequency, and if the clock can vary you'd need to be able to read the CPUs power state registers (or make an OS call) to find out the frequency at that instant.
With all that said, there may be other ways to accomplish what you're trying to do. For example if you want to make high-precision measurements of how long a particular codepath takes, the CPU likely has performance counters running at a fixed frequency which are a better measure of wall-clock time than reading a tick count register.

"lmbench" provides a cpu frequency algorithm portable for different architecture.
It runs some different loops and the processor's clock speed is the greatest common divisor of the execution frequencies of the various loops.
this method should always work when we are able to get loops with cycle counts that are relatively prime.
http://www.bitmover.com/lmbench/

One option is to sense the CPU frequency, by running code with known instructions per loop
This functionality is contained in 7zip, since about v9.20 I think.
> 7z b
7-Zip 9.38 beta Copyright (c) 1999-2014 Igor Pavlov 2015-01-03
CPU Freq: 4266 4000 4266 4000 2723 4129 3261 3644 3362
The final number is meant to be correct (and on my PC and many others, I have found it to be quite correct - the test runs very quick so turbo may not kick in, and servers set in Balanced/Power Save modes most likely give readings of around 1ghz)
The source code is at GitHub (Official source is a download from 7-zip.org)
With the most significant portion being:
#define YY1 sum += val; sum ^= val;
#define YY3 YY1 YY1 YY1 YY1
#define YY5 YY3 YY3 YY3 YY3
#define YY7 YY5 YY5 YY5 YY5
static const UInt32 kNumFreqCommands = 128;
EXTERN_C_BEGIN
static UInt32 CountCpuFreq(UInt32 sum, UInt32 num, UInt32 val)
{
for (UInt32 i = 0; i < num; i++)
{
YY7
}
return sum;
}
EXTERN_C_END

On Intel CPUs, a common method to get the current (average) CPU frequency is to calculate it from a few CPU counters:
CPU_freq = tsc_freq * (aperf_t1 - aperf_t0) / (mperf_t1 - mperf_t0)
The TSC (Time Stamp Counter) can be read from userspace with dedicated x86 instructions, but its frequency has to be determined by calibration against a clock. The best approach is to get the TSC frequency from the kernel (which already has done the calibration).
The aperf and mperf counters are model specific registers MSRs that require root privileges for access. Again, there are dedicated x86 instructions for accessing the MSRs.
Since the mperf counter rate is directly proportional to the TSC rate and the aperf rate is directly proportional to the CPU frequency you get the CPU frequency with the above equation.
Of course, if the CPU frequency changes in your t0 - t1 time delta (e.g. due due frequency scaling) you get the average CPU frequency with this method.
I wrote a small utility cpufreq which can be used to test this method.
See also:
[PATCH] x86: Calculate MHz using APERF/MPERF for cpuinfo and scaling_cur_freq. 2016-04-01, LKML
Frequency-invariant utilization tracking for x86. 2020-04-02, LWN.net

I'm not sure why you need assembly for this. If you're on a machine that has the /proc filesystem, then running:
> cat /proc/cpuinfo
might give you what you need.

A quick google on AMD and Intel shows that CPUID should give you access to the CPU`s max frequency.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight