Interrupt Latency using Debian on an AMD x86

Interrupt Latency using Debian on an AMD x86 - c

I have been measuring interrupt latency of IRQ5 (part of some custom hardware) on an AMD 586 (500 MHz), which is running Debian Linux 8.2. The results are strange:
90% of the time, the interrupt latency (measured with a scope) in 17-20 microsec
9% of the time, latency is less than 40-50 microsec
1% of the time, latency is less than 100 microsec
0.1% of the time, latency is more than 100 microsec
Worst-case latency is apparently 256 microsec. (This is a suspicious number).
I tried repeating the tests using an 800 MHz AMD 686. Same results.
I tried using Linux Mint. Same results.
I tried setting this interrupt to the highest priority. Same.
I tried turning off all other interrupts. Same.
I tried running in "single" mode. Same.
I tried booting with the "init=bin/bash" parameter. Same.
The whole thing is kind of maddening.
I suspect that the Video sub-system DMA is locking the memory for the amount for time which is required to perform one (large-ish) block, but don't have a good way to prove or disprove this theory, much less fix it.
Any ideas of suggestions would be appreciated. (and yes, not using Linux is the default plan at this point.)

Related

How to use perf or other utilities to measure the elapsed time of a loop/function

I am working on a project requiring profiling the target applications at first.
What I want to know is the exact time consumed by a loop body/function. The platform is BeagleBone Black board with Debian OS and installed perf_4.9.
gettimeofday() can provide a micro-second resolution but I still want more accurate results. It looks perf can give cycles statistics and thus be a good fit for purposes. However, perf can only analyze the whole application instead of individual loop/functions.
After trying the instructions posted in this Using perf probe to monitor performance stats during a particular function, it does not work well.
I am just wondering if there is any example application in C I can test and use on this board for my purpose. Thank you!

Caveat: This is more of comment than an answer but it's a bit too long for just a comment.
Thanks a lot for advising a new function. I tried that but get a little unsure about its accuracy. Yes, it can offer nanosecond resolution but there is inconsistency.
There will be inconsistency if you use two different clock sources.
What I do is first use clock_gettime() to measure a loop body, the approximate elasped time would be around 1.4us in this way. Then I put GPIO instructions, pull high and pull down, at beginning and end of the loop body, respectively and measure the signal frequency on this GPIO with an oscilloscope.
A scope is useful if you're trying to debug the hardware. It can also show what's on the pins. But, in 40+ years of doing performance measurement/improvement/tuning, I've never used it to tune software.
In fact, I would trust the CPU clock more than I would trust the scope for software performance numbers
For a production product, you may have to measure performance on a system deployed at a customer site [because the issue only shows up on that one customer's machine]. You may have to debug this remotely and can't hook up a scope there. So, you need something that can work without external probe/test rigs.
To my surprise, the frequency is around 1.8MHz, i.e., ~500ns. This inconsistency makes me a little confused... – GeekTao
The difference could be just round off error based on different time bases and latency in getting in/out of the device (GPIO pins). I presume you're just using GPIO in this way to facilitate benchmark/timing. So, in a way, you're not measuring the "real" system, but the system with the GPIO overhead.
In tuning, one is less concerned with absolute values than relative. That is, clock_gettime is ultimately based on number of highres clock ticks (at 1ns/tick or better from the system's free running TSC (time stamp counter)). What the clock frequency actually is doesn't matter as much. If you measure a loop/function and get X duration. Then, you change some code and get X+n, this tells you whether the code got faster or slower.
500ns isn't that large an amount. Almost any system wide action (e.g. timeslicing, syscall, task switch, etc.) could account for that. Unless you've mapped the GPIO registers into app memory, the syscall overhead could dwarf that.
In fact, just the overhead of calling clock_gettime could account for that.
Although the clock_gettime is technically a syscall, linux will map the code directly into the app's code via the VDSO mechanism so there is no syscall overhead. But, even the userspace code has some calculations to do.
For example, I have two x86 PCs. On one system the overhead of the call is 26 ns. On another system, the overhead is 1800 ns. Both these systems are at least 2GHz+
For your beaglebone/arm system, the base clock rate may be less, so overhead of 500 ns may be ballpark.
I usually benchmark the overhead and subtract it out from the calculations.
And, on the x86, the actual code just gets the CPU's TSC value (via the rdtsc instruction) and does some adjustment. For arm, it has a similar H/W register but requires special care to map userspace access to it (a coprocessor instruction, IIRC).
Speaking of arm, I was doing a commercial arm product (an nVidia Jetson to be exact). We were very concerned about latency of incoming video frames.
The H/W engineer didn't trust TSC [or software in general ;-)] and was trying to use a scope, an LED [controlled by a GPIO pin] and when the LED flash/pulse showed up inside the video frame (e.g. the coordinates of the white dot in the video frame were [effectively] a time measurement).
It took a while to convince the engineer, but, eventually I was able to prove that the clock_gettime/TSC approach was more accurate/reliable.
And, certainly, easier to use. We had multiple test/development/SDK boards but could only hook up the scope/LED rig on one at a time.

How can I measure real cpu usage in linux?

I know that there are tools like top and ps for measuring CPU usage, but the way they measure the CPU usage is by measuring how much time the idle task was not running. So, for example, even if a CPU has a stall due to a cache miss, these tools would still consider the CPU to be occupied. However, what I want is for the profiling tool to consider the CPU as idle during a stall. Is there any tool which does that?

tools like top and ps for measuring CPU usage, .. measure the CPU usage is by measuring how much time the idle task was not running.
No, they don't measure idle, they just read what kernel thinks about its CPU usage via /proc/stat (try vmstat 1 tool too). Did you check that system-wide user + system times are accounted only by idle? I think, kernel just exports some stats of scheduler, which records user/system state on rescheduling, both on system timer and on blocking system calls (possibly one of callers of cpuacct_charge, like update_curr - Update the current task's runtime statistics.).
/proc/stat example:
cat /proc/stat
cpu 2255 34 2290 22625563 6290 127 456
and decode by http://www.linuxhowtos.org/System/procstat.htm
The very first "cpu" line aggregates the numbers in all of the other "cpuN" lines. These numbers identify the amount of time the CPU has spent performing different kinds of work. Time units are in USER_HZ or Jiffies (typically hundredths of a second).
The meanings of the columns are as follows, from left to right:
user: normal processes executing in user mode
nice: niced processes executing in user mode
system: processes executing in kernel mode
idle: twiddling thumbs
When we hear jiffie, it means that scheduler was used to get the numbers, not estimating of idle task (top even don't see this task or tasks with pid 0).
So, for example, even if a CPU has a stall due to a cache miss, these tools would still consider the CPU to be occupied.
And basically (when there is no SMT, like HT in Intels), CPU is occupied when your task has pipeline stall due to memory access (or taking wrong path with out-of-order). OS can't run other task, because task switch is more expensive that waiting this one stall.
Case of SMT is different, because there is hardware which either switchs two logical tasks on single hardware, or even (in fine grained SMT) mixing their instructions (microoperations) into the single stream to execute on shared hardware. There are usually SMT statistic counters to check actual mixing.
However, what I want is for the profiling tool to consider the CPU as idle during a stall. Is there any tool which does that?
Performance monitoring unit may have useful events for this. For example, perf stat reports some (on Sandy Bridge)
$ perf stat /bin/sleep 10
Performance counter stats for '/bin/sleep 10':
0,563759 task-clock # 0,000 CPUs utilized
1 context-switches # 0,002 M/sec
0 CPU-migrations # 0,000 M/sec
175 page-faults # 0,310 M/sec
888 767 cycles # 1,577 GHz
568 603 stalled-cycles-frontend # 63,98% frontend cycles idle
445 567 stalled-cycles-backend # 50,13% backend cycles idle
593 103 instructions # 0,67 insns per cycle
# 0,96 stalled cycles per insn
115 709 branches # 205,246 M/sec
7 699 branch-misses # 6,65% of all branches
10,000933734 seconds time elapsed
So, it says that 0,5 jiffie (task-clock) was used by the sleep 10. It is too low to be accounted in classic rusage, and /usr/bin/time got 0 jiffie as task CPU usage (user + system):
$ /usr/bin/time sleep 10
0.00user 0.00system 0:10.00elapsed 0%CPU (0avgtext+0avgdata 2608maxresident)k
0inputs+0outputs (0major+210minor)pagefaults 0swaps
Then perf measures (counts with help of PMU) real cycles and real instructions executed by the task (and by kernel on behalf of the task) - cycles and instructions lines. Sleep has used 888k cycles but only 593k useful instructions were finished, and mean IPC was 0.6-0.7 (30-40% stalls). Around 300k cycles was lost; and on Sandy bridge perf reports where they were lost - stalled-cycles-* events for frontend (decoder - CPU don't know what to execute due to branch miss or due to code not prefetched to L1I) and for backend (can't execute because instruction needs some data from memory which is not available at right time - memory stall).
Why we see more stalls inside CPU when there should be only 300k cycles without any instruction executed? This is because modern processors are often superscalar and out-of-order - they can start to executing several instructions every CPU clock tick, and even reorder them. If you want to see execution port utilization, try ocperf (perf wrapper) from Andi Kleen's pmu-tools and some Intel manuals about their PMU counters. There is also toplev.py script to "identify the micro-architectural bottleneck for a workload" without selecting Intel events by hands.

How to measure the power consumed by a C algorithm while running on a Pentium 4 processor?

How can I measure the power consumed by a C algorithm while running on a Pentium 4 processor (and any other processor will also do)?

Since you know the execution time, you can calculate the energy used by the CPU by looking up the power consumption on the P4 datasheet. For example, a 2.2 GHz P4 with a 400 MHz FSB has a typical Vcc of 1.3725 Volts and Icc of 47.9 Amps which is (1.3725*47.9=) 65.74 watts. Since you know your loop of 10,000 algorithm cycles took 46.428570s, you assume a single loop will take 46.428570/10000 = 0.00454278s. The amount of energy consumed by your algorithm would then be 65.74 watts * 0.00454278s = 0.305 watt seconds (or joules).
To convert to kilowatt hours: 0.305 watt seconds * 1000 kilowatts/watt * 1 hour / 3600 seconds = 0.85 kwh. A utility company charges around $0.11 per kwh so this algorithm would cost 0.85 kwh * $0.11 = about a penny to run.
Keep in mind this is the CPU only...none of the rest of the computer.

Run your algorithm in a long loop with a Kill-a-Watt attached to the machine?

Excellent question; I upvoted it. I haven't got a clue, but here's a methodology:
-- get CPU spec sheet from Intel (or AMD or whoever) or see Wikipedia; that should tell you power consumption at max FLOP rate;
-- translate algorithm into FLOPs;
-- do some simple arithmetic;
-- post your data and calculations to SO and invite comments and further data
Of course, you'll have to frame your next post as another question, I'll watch with interest.

Unless you run the code on a simple single tasking OS such as DOS or and RTOS where you get precise control of what runs at any time, the OS will typically be running many other processes simultaneously. It may be difficult to distinguish between your process and any others.

First, you need to be running the simplest OS that supports your code (probably a server version unix of some sort, I expect this to be impractical on Windows). That's to avoid the OS messing up your measurements.
Then you need to instrument the box with a sensitive datalogger between the power supply and motherboard. This is going to need some careful hardware engineering so as not to mess up the PCs voltage regulation, but someone must have done it.
I have actually done this with an embedded MIPS box and a logging multimeter, but that had a single 12V power supply. Actually, come to think of it, if you used a power supply built for running a PC in a vehicle, you would have a 12V supply and all you'd need then is a lab PSU with enough amps to run the thing.

It's hard to say.
I would suggest you to use a Current Clamp, so you can measure all the power being consumed by your CPU. Then you should measure the idle consumption of your system, get the standard value with as low a standard deviation as possible.
Then run the critical code in a loop.
Previous suggestions about running your code under DOS/RTOS are also valid, but maybe it will not compile the same way as your production...

Sorry, I find this question senseless.
Why ? Because an algorithm itself has (with the following exceptions*) no correlation with the power consumption, it is the priority on the program/thread/process runs. If you change the priority, you change the amount of idle time the processor has and therefore the power consumption. I think the only difference in energy consumption between the instructions is the number of cycles needed, so fast code will be power friendly.
To measure power consumption of a "algorithm" is senseless if you don't mean the performance.
*Exceptions: Threads which can be idle while waiting for other threads, programs which use the HLT instruction.
Sure running the processor at fast as possible increases the amount of energy superlinearly
(more heat, more cooling needed), but that is a hardware problem. If you want to spare energy, you can downclock the processor or use energy-efficient ones (Atom processor), but changing/tweaking the code won't change anything.
So I think it makes much more sense to ask the processor producer for specifications what different processor modes exist and what energy consumption they have. You also need to know that the periphery (fan, power supply, graphics card (!)) and the running software on the system will influence the results of measuring computer power.
Why do you need this task anyway ?

When benchmarking what would make the elapsed CPU time LESS than the user CPU time

Following up on a question I posed earlier:
I ended up with a User CPU time and Total CPU time that was about 4% longer in duration than the elapsed real time. Based on the accepted answer to my earlier question, I don't understand how this could be the case. Could anyone explain this?

Multithreaded code on multiple cores can use more than 100% CPU time.

Because if I use two CPUs at 100% for 10 minutes, I've used 20 minutes worth of CPU time (i.e. were one of those CPUs disabled, it would take 20 minutes for my operation to complete)

One possibility to benchmarks being off by a small margin is due to incorrect timer resolution.
There are quite a few ways of determining those values (time, ticks, CPU frequency, OS API, etc) so not all benchmark routines are 100% reliable.

CPU clock frequency and thus QueryPerformanceCounter wrong?

I am using QueryPerformanceCounter to time some code. I was shocked when the code starting reporting times that were clearly wrong. To convert the results of QPC into "real" time you need to divide by the frequency returned from QueryPerformanceFrequency, so the elapsed time is:
Time = (QPC.end - QPC.start)/QPF
After a reboot, the QPF frequency changed from 2.7 GHz to 4.1 GHz. I do not think that the actual hardware frequency changed as the wall clock time of the running program did not change although the time reported using QPC did change (it dropped by 2.7/4.1).
MyComputer->Properties shows:
Intel(R)
Pentium(R)
4 CPU 2.80 GHz; 4.11 GHz;
1.99 GB of RAM; Physical Address Extension
Other than this, the system seems to be working fine.
I will try a reboot to see if the problem clears, but I am concerned that these critical performance counters could become invalid without warning.
Update:
While I appreciate the answers and especially the links, I do not have one of the affected chipsets nor to I have a CPU clock that varies itself. From what I have read, QPC and QPF are based on a timer in the PCI bus and not affected by changes in the CPU clock. The strange thing in my situation is that the FREQUENCY reported by QPF changed to an incorrect value and this changed frequency was also reported in MyComputer -> Properties which I certainly did not write.
A reboot fixed my problem (QPF now reports the correct frequency) but I assume that if you are planning on using QPC/QPF you should validate it against another timer before trusting it.

Apparently there is a known issue with QPC on some chipsets, so you may want to make sure you do not have those chipset. Additionally some dual core AMDs may also cause a problem. See the second post by sebbbi, where he states:
QueryPerformanceCounter() and
QueryPerformanceFrequency() offer a
bit better resolution, but have
different issues. For example in
Windows XP, all AMD Athlon X2 dual
core CPUs return the PC of either of
the cores "randomly" (the PC sometimes
jumps a bit backwards), unless you
specially install AMD dual core driver
package to fix the issue. We haven't
noticed any other dual+ core CPUs
having similar issues (p4 dual, p4 ht,
core2 dual, core2 quad, phenom quad).
From this answer.

You should always expect the core frequency to change on any CPU that supports technology such as SpeedStep or Cool'n'Quiet. Wall time is not affected, it uses a RTC. You should probably stop using the performance counters, unless you can tolerate a few (5-50) millisecond's worth of occasional phase adjustments, and are willing to perform some math in order to perform the said phase adjustment by continuously or periodically re-normalizing your performance counter values based on the reported performance counter frequency and on RTC low-resolution time (you can do this on-demand, or asynchronously from a high-resolution timer, depending on your application's ultimate needs.)

You can try to use the Stopwatch class from .NET, it could help with your problem since it abstracts from all this low-lever stuff.
Use the IsHighResolution property to see whether the timer is based on a high-resolution performance counter.
Note: On a multiprocessor computer, it
does not matter which processor the
thread runs on. However, because of
bugs in the BIOS or the Hardware
Abstraction Layer (HAL), you can get
different timing results on different
processors. To specify processor
affinity for a thread, use the
ProcessThread..::.ProcessorAffinity
method.

Just a shot in the dark.
On my home PC I used to have "AI NOS" or something like that enabled in the BIOS. I suspect this screwed up the QueryPerformanceCounter/QueryPerformanceFrequency APIs because although the system clock ran at the normal rate, and normal apps ran perfectly, all full screen 3D games ran about 10-15% too fast, causing, for example, adjacent lines of dialog in a game to trip on each other.

I'm afraid you can't say "I shouldn't have this problem" when you're using QueryPerformance* - while the documentation states that the value returned by QueryPerformanceFrequency is constant, practical experimentation shows that it really isn't.
However you also don't want to be calling QPF every time you call QPC either. In practice we found that periodically (in our case once a second) calling QPF to get a fresh value kept the timers synchronised well enough for reliable profiling.
As has been pointed out as well, you need to keep all of your QPC calls on a single processor for consistent results. While this might not matter for profiling purposes (because you can just use ProcessorAffinity to lock the thread onto a single CPU), you don't want to do this for timing which is running as part of a proper multi-threaded application (because then you run the risk of locking a hard working thread to a CPU which is busy).
Especially don't arbitrarily lock to CPU 0, because you can guarantee that some other badly coded application has done that too, and then both applications will fight over CPU time on CPU 0 while CPU 1 (or 2 or 3) sit idle. Randomly choose from the set of available CPUs and you have at least a fighting chance that you're not locked to an overloaded CPU.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight