I know that there are tools like top and ps for measuring CPU usage, but the way they measure the CPU usage is by measuring how much time the idle task was not running. So, for example, even if a CPU has a stall due to a cache miss, these tools would still consider the CPU to be occupied. However, what I want is for the profiling tool to consider the CPU as idle during a stall. Is there any tool which does that?
tools like top and ps for measuring CPU usage, .. measure the CPU usage is by measuring how much time the idle task was not running.
No, they don't measure idle, they just read what kernel thinks about its CPU usage via /proc/stat (try vmstat 1 tool too). Did you check that system-wide user + system times are accounted only by idle? I think, kernel just exports some stats of scheduler, which records user/system state on rescheduling, both on system timer and on blocking system calls (possibly one of callers of cpuacct_charge, like update_curr - Update the current task's runtime statistics.).
/proc/stat example:
cat /proc/stat
cpu 2255 34 2290 22625563 6290 127 456
and decode by http://www.linuxhowtos.org/System/procstat.htm
The very first "cpu" line aggregates the numbers in all of the other "cpuN" lines. These numbers identify the amount of time the CPU has spent performing different kinds of work. Time units are in USER_HZ or Jiffies (typically hundredths of a second).
The meanings of the columns are as follows, from left to right:
user: normal processes executing in user mode
nice: niced processes executing in user mode
system: processes executing in kernel mode
idle: twiddling thumbs
When we hear jiffie, it means that scheduler was used to get the numbers, not estimating of idle task (top even don't see this task or tasks with pid 0).
So, for example, even if a CPU has a stall due to a cache miss, these tools would still consider the CPU to be occupied.
And basically (when there is no SMT, like HT in Intels), CPU is occupied when your task has pipeline stall due to memory access (or taking wrong path with out-of-order). OS can't run other task, because task switch is more expensive that waiting this one stall.
Case of SMT is different, because there is hardware which either switchs two logical tasks on single hardware, or even (in fine grained SMT) mixing their instructions (microoperations) into the single stream to execute on shared hardware. There are usually SMT statistic counters to check actual mixing.
However, what I want is for the profiling tool to consider the CPU as idle during a stall. Is there any tool which does that?
Performance monitoring unit may have useful events for this. For example, perf stat reports some (on Sandy Bridge)
$ perf stat /bin/sleep 10
Performance counter stats for '/bin/sleep 10':
0,563759 task-clock # 0,000 CPUs utilized
1 context-switches # 0,002 M/sec
0 CPU-migrations # 0,000 M/sec
175 page-faults # 0,310 M/sec
888 767 cycles # 1,577 GHz
568 603 stalled-cycles-frontend # 63,98% frontend cycles idle
445 567 stalled-cycles-backend # 50,13% backend cycles idle
593 103 instructions # 0,67 insns per cycle
# 0,96 stalled cycles per insn
115 709 branches # 205,246 M/sec
7 699 branch-misses # 6,65% of all branches
10,000933734 seconds time elapsed
So, it says that 0,5 jiffie (task-clock) was used by the sleep 10. It is too low to be accounted in classic rusage, and /usr/bin/time got 0 jiffie as task CPU usage (user + system):
$ /usr/bin/time sleep 10
0.00user 0.00system 0:10.00elapsed 0%CPU (0avgtext+0avgdata 2608maxresident)k
0inputs+0outputs (0major+210minor)pagefaults 0swaps
Then perf measures (counts with help of PMU) real cycles and real instructions executed by the task (and by kernel on behalf of the task) - cycles and instructions lines. Sleep has used 888k cycles but only 593k useful instructions were finished, and mean IPC was 0.6-0.7 (30-40% stalls). Around 300k cycles was lost; and on Sandy bridge perf reports where they were lost - stalled-cycles-* events for frontend (decoder - CPU don't know what to execute due to branch miss or due to code not prefetched to L1I) and for backend (can't execute because instruction needs some data from memory which is not available at right time - memory stall).
Why we see more stalls inside CPU when there should be only 300k cycles without any instruction executed? This is because modern processors are often superscalar and out-of-order - they can start to executing several instructions every CPU clock tick, and even reorder them. If you want to see execution port utilization, try ocperf (perf wrapper) from Andi Kleen's pmu-tools and some Intel manuals about their PMU counters. There is also toplev.py script to "identify the micro-architectural bottleneck for a workload" without selecting Intel events by hands.
Related
I am trying to programmatically determine what the idle/running break-down is for the CPU cores.
In Linux, I use /proc/stat for this, which gives me cycle counts for cycles spent in:
user: normal processes executing in user mode
nice: niced processes executing in user mode
system: processes executing in kernel mode
idle: twiddling thumbs
iowait: waiting for I/O to complete
irq: servicing interrupts
softirq: servicing softirqs
Please note: I am getting system-wide numbers, not the cpu-usage for a specific process!
Now, I want to do the same, but in C, for a Windows system.
UPDATE
I've been able to find an aggregate statistic, so not per core: GetSystemTimes() which will return a value for idle, kernel and user cycles. What really confused me at first, is that the kernel cycles include idle cycles.
I am trying to determine time needed to read an element to make sure it's a cache hit or a cache miss. for reading to be in order I use _mm_lfence() function. I got unexpected results and after checking I saw that lfence function's overhead is not deterministic.
So I am executing the program that measures this overhead in a loop of for example 100 000 iteration. I get results of more than 1000 clock cycle for one iteration and next time it's 200. What can be a reason of such difference between lfence function overheads and if it is so unreliable how can I judge latency of cache hits and cache misses correctly? I was trying to use same approach as in this post: Memory latency measurement with time stamp counter
the code that gives unreliable results is this:
for(int i=0; i < arr_size; i++){
_mm_mfence();
_mm_lfence();
t1 = __rdtsc();
_mm_lfence();
_mm_lfence();
t2 = __rdtsc();
_mm_lfence();
arr[i] = t2-t1;
}
the values in arr vary in different ranges, arr_size is 100 000.
I get results of more than 1000 clock cycle for one iteration and next time it's 200.
Sounds like your CPU ramped up from idle to normal clock speed after the first few iterations.
Remember that RDTSC counts reference cycles (fixed frequency, equal or close to the max non-turbo frequency of the CPU), not core clock cycles. (idle/turbo / whatever). Older CPUs had RDTSC count core clock cycles, but for years now CPU vendors have had fixed RDTSC frequency making it useful for clock_gettime(), and advertized this fact with the invariant_tsc CPUID feature bit. See also Get CPU cycle count?
If you really want to use RDTSC instead of performance counters, disable turbo and use a warm-up loop to get your CPU to its max frequency.
There are libraries that let you program the HW performance counters, and set permissions so you can run rdpmc in user-space. This actually has lower overhead than rdtsc. See What will be the exact code to get count of last level cache misses on Intel Kaby Lake architecture for a summary of ways to access perf counters in user-space.
I also found a paper about adding user-space rdpmc support to Linux perf (PAPI): ftp://ftp.cs.uoregon.edu/pub/malony/ESPT/Papers/espt-paper-1.pdf. IDK if that made it into mainline kernel/perf code or not.
I have been measuring interrupt latency of IRQ5 (part of some custom hardware) on an AMD 586 (500 MHz), which is running Debian Linux 8.2. The results are strange:
90% of the time, the interrupt latency (measured with a scope) in 17-20 microsec
9% of the time, latency is less than 40-50 microsec
1% of the time, latency is less than 100 microsec
0.1% of the time, latency is more than 100 microsec
Worst-case latency is apparently 256 microsec. (This is a suspicious number).
I tried repeating the tests using an 800 MHz AMD 686. Same results.
I tried using Linux Mint. Same results.
I tried setting this interrupt to the highest priority. Same.
I tried turning off all other interrupts. Same.
I tried running in "single" mode. Same.
I tried booting with the "init=bin/bash" parameter. Same.
The whole thing is kind of maddening.
I suspect that the Video sub-system DMA is locking the memory for the amount for time which is required to perform one (large-ish) block, but don't have a good way to prove or disprove this theory, much less fix it.
Any ideas of suggestions would be appreciated. (and yes, not using Linux is the default plan at this point.)
I can take a guess based on the names, but what specifically are wall-clock-time, user-cpu-time, and system-cpu-time in Unix?
Is user-cpu time the amount of time spent executing user-code while kernel-cpu time the amount of time spent in the kernel due to the need of privileged operations (like I/O to disk)?
What unit of time is this measurement in?
And is wall-clock time really the number of seconds the process has spent on the CPU or is the name just misleading?
Wall-clock time is the time that a clock on the wall (or a stopwatch in hand) would measure as having elapsed between the start of the process and 'now'.
The user-cpu time and system-cpu time are pretty much as you said - the amount of time spent in user code and the amount of time spent in kernel code.
The units are seconds (and subseconds, which might be microseconds or nanoseconds).
The wall-clock time is not the number of seconds that the process has spent on the CPU; it is the elapsed time, including time spent waiting for its turn on the CPU (while other processes get to run).
Wall clock time: time elapsed according to the computer's internal clock, which should match time in the outside world. This has nothing to do with CPU usage; it's given for reference.
User CPU time and system time: exactly what you think. System calls, which include I/O calls such as read, write, etc. are executed by jumping into kernel code and executing that.
If wall clock time < CPU time, then you're executing a program in parallel. If wall clock time > CPU time, you're waiting for disk, network or other devices.
All are measured in seconds, per the SI.
time [WHAT-EVER-COMMAND]
real 7m2.444s
user 76m14.607s
sys 2m29.432s
$ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 24
real or wall-clock
real 7m2.444s
On a system with a 24 core-processor, this cmd/process took more than 7 minutes to complete. That by utilizing the most possible parallelism with all given cores.
user
user 76m14.607s
The cmd/process has utilized this much amount of CPU time.
In other words, on machine with single core CPU, the real and user will be nearly equal, so the same command will take approximately 76 minutes to complete.
sys
sys 2m29.432s
This is the time taken by the kernel to execute all the basic/system level operations to run this cmd, including context switching, resource allocation, etc.
Note: The example assumes that your command utilizes parallelism/threads.
Detailed man page: https://linux.die.net/man/1/time
Wall clock time is exactly what it says, the time elapsed as measured by the clock on your wall (or wristwatch)
User CPU time is the time spent in "user land", that is time spent on non-kernel processes.
System CPU time is time spent in the kernel, usually time spent servicing system calls.
Is it possible to determine the throughput of an application on a processor from the cycle counts (Processor instruction cycles) consumed by the application ? If yes, how to calculate it ?
If the process is entirely CPU bound, then you divide the processor speed by the number of cycles to get the throughput.
In reality, few processes are entirely CPU bound though, in which case you have to take other factors (disk speed, memory speed, serialization, etc.) into account.
Simple:
#include <time.h>
clock_t c;
c = clock(); // c holds clock ticks value
c = c / CLOCKS_PER_SEC; // real time, if you need it
Note that the value you get is an approximation, for more info see the clock() man page.
Some CPUs have internal performance registers which enable you to collect all sorts of interesting statistics, such as instruction cycles (sometimes even on a per execution unit basis), cache misses, # of cache/memory reads/writes, etc. You can access these directly, but depending on what CPU and OS you are using there may well be existing tools which manage all the details for you via a GUI. Often a good profiling tool will have support for performance registers and allow you to collect statistics using them.
If you use the Cortex-M3 from TI/Luminary Micro, you can make use of the driverlib delivered by TI/Luminary Micro.
Using the SysTick functions you can set the SysTickPeriod to 1 processor cycle: So you have 1 processor clock between interrupts. By counting the number of interrupts you should get a "near enough estimation" on how much time a function or function block take.