Real time Linux: disable local timer interrupts

Real time Linux: disable local timer interrupts - timer

TL;DR : Using Linux kernel real time with NO_HZ_FULL I need to isolate a process in order to have deterministic results but /proc/interrupts tell me there is still local timer interrupts (among other). How to disable it?
Long version :
I want to make sure my program is not being interrupt so I try to use a real time Linux kernel.
I'm using the real time version of arch Linux (linux-rt on AUR) and I modified the configuration of the kernel to selection the following options :
CONFIG_NO_HZ_FULL=y
CONFIG_NO_HZ_FULL_ALL=y
CONFIG_RCU_NOCB_CPU=y
CONFIG_RCU_NOCB_CPU_ALL=y
then I reboot my computer to boot on this real time kernel with the folowing options:
nmi_watchdog=0
rcu_nocbs=1
nohz_full=1
isolcpus=1
I also disable the following option in the BIOS :
C state
intel speed step
turbo mode
VTx
VTd
hyperthreading
My CPU (i7-6700 3.40GHz) has 4 cores (8 logical CPU with hyperthreading technology)
I can see CPU0, CPU1, CPU2, CPU3 in /proc/interrupts file.
CPU1 is isolated by isolcpus kernel parameter and I want to disable the local timer interrupts on this CPU.
I though real-time kernel with CONFIG_NO_HZ_FULL and CPU isolation (isolcpus) was enough to do it and I try to check by running theses command :
cat /proc/interrupts | grep LOC > ~/tmp/log/overload_cpu1
taskset -c 1 ./overload
cat /proc/interrupts | grep LOC >> ~/tmp/log/overload_cpu1
where the overload process is:
***overload.c:***
int main()
{
for(int i=0;i<100;++i)
for(int j=0;j<100000000;++j);
}
The file overload_cpu1 contains the result:
LOC: 234328 488 12091 11299 Local timer interrupts
LOC: 239072 651 12215 11323 Local timer interrupts
meanings 651-488 = 163 interrupts from local timer and not 0...
For comparison I do the same experiment but I change the core where my process overload run (I keep watching interrupts on CPU1):
taskset -c 0 : 8 interrupts
taskset -c 1 : 163 interrupts
taskset -c 2 : 7 interrupts
taskset -c 3 : 8 interrupts
One of my question is why there is no 0 interrupts ? why the number of interrupts is bigger when my process run on CPU1 ? (I mean I though NO_HZ_FULL will prevent interrupt if my process was alone : "The CONFIG_NO_HZ_FULL=y Kconfig option causes the kernel to avoid
sending scheduling-clock interrupts to CPUs with a single runnable task"(https://www.kernel.org/doc/Documentation/timers/NO_HZ.txt)
Maybe an explaination is there is other process running on CPU1.
I checked by using ps command :
CLS CPUID RTPRIO PRI NI CMD PID
TS 1 - 19 0 [cpuhp/1] 18
FF 1 99 139 - [migration/1] 20
TS 1 - 19 0 [rcuc/1] 21
FF 1 1 41 - [ktimersoftd/1] 22
TS 1 - 19 0 [ksoftirqd/1] 23
TS 1 - 19 0 [kworker/1:0] 24
TS 1 - 39 -20 [kworker/1:0H] 25
FF 1 1 41 - [posixcputmr/1] 28
TS 1 - 19 0 [kworker/1:1] 247
TS 1 - 39 -20 [kworker/1:1H] 501
As you can see, there is threads on the CPU1.
Is that possible to disable these processes ? I guess it is because if it is not the case, NO_HZ_FULL will never work right ?
Tasks with class TS doesn't disturb me because they didn't have priority among SCHED_FIFO and I can set this policy to my program.
Same things for tasks with class FF and priority less than 99.
However, you can see migration/1 that is in SCHED_FIFO and priority 99.
Maybe these process can causes interrupts when they run . This explain the few interrupts when my process in on CPU0, CPU2 and CPU3 (respectively 8,7 and 8 interrupts) but it also mean these processes are not running very often and then doesn't explain why there is many interrupts when my process run on CPU1 (163 interrupts).
I also do the same experiment but with the SCHED_FIFO of my overload process and I get:
taskset -c 0 : 1
taskset -c 1 : 4063
taskset -c 2 : 1
taskset -c 3 : 0
In this configuration there is more interrupts in the case my process use SCHED_FIFO policy on CPU1 and less on other CPU. do you know why ?

The thing is that a full-tickless CPU (a.k.a. adaptive-ticks, configured with nohz_full=) still receives some ticks.
Most notably the scheduler requires a timer on an isolated full tickless CPU for updating some state every second or so.
This is a documented limitation (as of 2019):
Some process-handling operations still require the occasional
scheduling-clock tick. These operations include calculating CPU
load, maintaining sched average, computing CFS entity vruntime,
computing avenrun, and carrying out load balancing. They are
currently accommodated by scheduling-clock tick every second
or so. On-going work will eliminate the need even for these
infrequent scheduling-clock ticks.
(source: Documentation/timers/NO_HZ.txt, cf. the LWN article (Nearly) full tickless operation in 3.10 from 2013 for some background)
A more accurate method to measure the local timer interrupts (LOC row in /proc/interrupts) is to use perf. For example:
$ perf stat -a -A -e irq_vectors:local_timer_entry ./my_binary
Where my_binary has threads pinned to the isolated CPUs that non-stop utilize the CPU without invoking syscalls - for - say 2 minutes.
There are other sources of additional local timer ticks (when there is just 1 runnable task).
For example, the collection of VM stats - by default they are collected each seconds. Thus, I can decrease my LOC interrupts by setting a higher value, e.g.:
# sysctl vm.stat_interval=60
Another source are periodic checks if the TSC on the different CPUs doesn't drift - you can disable those with the following kernel option:
tsc=reliable
(Only apply this option if you really know that your TSCs don't drift.)
You might find other sources by recording traces with ftrace (while your test binary is running).
Since it came up in the comments: Yes, the SMI is fully transparent to the kernel. It doesn't show up as NMI. You can only detect an SMI indirectly.

Related

Looking for cause of unexpected preemption in linux kernel module

I have a small linux kernel module that is a prototype for a device driver for hardware that doesn't exist yet. The code needs to do a short bit of computation as fast as possible from beginning to end with a duration that is a few microseconds. I am trying to measure whether this is possible with the intel rdtscp instruction using an ndelay() call to simulate the computation. I find that 99.9% of the time it runs as expected, but 0.1% of the time it has a very large delay that appears as if something else is preempting the code despite running inside a spinlock which should be disabling interrupts. This is run using a stock Ubuntu 64 bit kernel (4.4.0-112) with no extra realtime or low latency patches.
Here is some example code that replicates this behavior. This is written as a handler for a /proc filesystem entry for easy testing, but I have only shown the function that actually computes the delays:
#define ITERATIONS 50000
#define SKIPITER 10
DEFINE_SPINLOCK(timer_lock);
static int timing_test_show(struct seq_file *m, void *v)
{
uint64_t i;
uint64_t first, start, stop, delta, max=0, min=1000000;
uint64_t avg_ticks;
uint32_t a, d, c;
unsigned long flags;
int above30k=0;
__asm__ volatile ("rdtscp" : "=a" (a), "=d" (d) : : "rcx");
first = a | (((uint64_t)d)<<32);
for (i=0; i<ITERATIONS; i++) {
spin_lock_irqsave(&timer_lock, flags);
__asm__ volatile ("rdtscp" : "=a" (a), "=d" (d) : : "rcx");
start = a | (((uint64_t)d)<<32);
ndelay(1000);
__asm__ volatile ("rdtscp" : "=a" (a), "=d" (d) : : "rcx");
stop = a | (((uint64_t)d)<<32);
spin_unlock_irqrestore(&timer_lock, flags);
if (i < SKIPITER) continue;
delta = stop-start;
if (delta < min) min = delta;
if (delta > max) max = delta;
if (delta > 30000) above30k++;
}
seq_printf(m, "min: %llu max: %llu above30k: %d\n", min, max, above30k);
avg_ticks = (stop - first) / ITERATIONS;
seq_printf(m, "Average total ticks/iteration: %llu\n", avg_ticks);
return 0;
}
Then if I run:
# cat /proc/timing_test
min: 4176 max: 58248 above30k: 56
Average total ticks/iteration: 4365
This is on a 3.4 GHz sandy bridge generation Core i7. The ~4200 ticks of the TSC is about right for a little over 1 microsecond delay. About 0.1% of the time I see delays about 10x longer than expected, and in some cases I have seen times as long as 120,000 ticks.
These delays appear too long to be a single cache miss, even to DRAM. So I think it either has to be several cache misses, or another task preempting the CPU in the middle of my critical section. I would like to understand the possible causes of this to see if they are something we can eliminate or if we have to move to a custom processor/FPGA solution.
Things I have tried:
I considered if this could be caused by cache misses. I don't think that could be the case since I ignore the first few iterations which should load the cache. I have verified by examining disassembly that there are no memory operations between the two calls to rdtscp, so I think the only possible cache misses are for the instruction cache.
Just in case, I moved the spin_lock calls around the outer loop. Then it shouldn't be possible to have any cache misses after the first iteration. However, this made the problem worse.
I had heard that the SMM interrupt is unmaskable and mostly transparent and could cause unwanted preemption. However, you can read the SMI interrupt count with rdmsr on MSR_SMI_COUNT. I tried adding that before and after and there are no SMM interrupts happening while my code is executing.
I understand there are also inter-processor interrupts in SMP systems that may interrupt, but I looked at /proc/interrupts before and after and don't see enough of them to explain this behavior.
I don't know if ndelay() takes into account variable clock speed, but I think the CPU clock only varies by a factor of 2, so this should not cause a >10x change.
I booted with nopti to disable page table isolation in case that is causing problems.

Another thing that I have just noticed is that it is unclear what ndelay() does. Maybe you should show it so as non-trivial problems may be lurking inside it.
For example, I've observed once that my piece of a kernel driver code was still preempted when it had a memory leak inside it, so as soon as it hit some watermark limit, it was put aside even if it disabled interrupts.

120,000 ticks that you observed in extreme cases sounds a lot like an SMM handler. Lesser values might have been caused by an assortment of microarchitectural events (by the way, have you checked all the performance counters available to you?), but this is something that must be caused by a subroutine written by someone who was not writing his/her code to achieve minimal latency.
However you stated that you've checked that no SMIs are observed. This leads me to think that either something is wrong with kernel facilities to count/report them, or with your method to look after them. Hunting after SMI without a hardware debugger may be a frustrating endeavor.
Was SMI_COUNT not changing during your experiment course, or was it exactly zero all the time? The latter might indicate that it does not count anything, unless you system is completely free from SMI, which I doubt of in case of regular Sandy Bridge.
It may be that SMIs are delivered to another core in your system, and an SMM handler is synchronizing other cores through some sort of mechanism that does not show up on SMI_COUNT. Have you checked other cores?
In general I would recommend starting downsizing your system under test to exclude as much of stuff as possible. Have you tried booting it with a single core and no hyperthreading enabled in BIOS? Have you tried to run the same code on a system that is known to not have SMIs? The same goes with disabling Turbo Boost and Frequency scaling in BIOS. As much as possible of timing-related must go.

FYI, in my system:
timingtest % uname -a
Linux xxxxxx 4.15.0-42-generic #45-Ubuntu SMP Thu Nov 15 19:32:57 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
Replicating your example (with ndelay(1000);) I get:
timingtest % sudo cat /proc/timing_test
min: 3783 max: 66883 above30k: 20
Average total ticks/iteration: 4005
timingtest % sudo cat /proc/timing_test
min: 3783 max: 64282 above30k: 19
Average total ticks/iteration: 4010
Replicating your example (with udelay(1);) I get:
timingtest % sudo cat /proc/timing_test
min: 3308 max: 43301 above30k: 2
Average total ticks/iteration: 3611
timingtest % sudo cat /proc/timing_test
min: 3303 max: 44244 above30k: 2
Average total ticks/iteration: 3600
ndelay(),udelay(),mdelay() are for use in atomic context as stated here:
https://www.kernel.org/doc/Documentation/timers/timers-howto.txt
They all rely on __const_udelay() funtion that is a vmlinux exported symbol (using: LFENCE/RDTSC instructions).
Anyway, I replaced the delay with:
for (delta=0,c=0; delta<500; delta++) {c++; c|=(c<<24); c&=~(c<<16);}
for a trivial busy loop, with the same results.
I also tryed with _cli()/_sti(), local_bh_disable()/local_bh_enable() and preempt_disable()/preempt_enable() without success.
Examinig SMM interrupts (before and after delay) with:
__asm__ volatile ("rdmsr" : "=a" (a), "=d" (d) : "c"(0x34) : );
smi_after = (a | (((uint64_t)d)<<32));
I always obtain the same number (no SMI or register not updated).
Executing the cat command with trace-cmd to explore what's happening, I get results suprissingly not so scattered in time. (!?)
timingtest % sudo trace-cmd record -o trace.dat -p function_graph cat /proc/timing_test
plugin 'function_graph'
min: 3559 max: 4161 above30k: 0
Average total ticks/iteration: 5863
...
In my system, the problem can be solved making use of Power management Quality of Service, see (https://access.redhat.com/articles/65410). Hope this helps

Why sys time of process is showing higher when the strace command is showing significantly less time?

I have written a program using C which has two threads.
Initially it was,
for(int i=0;i<n;i++){
long_operation(arr[i]);
}
then I divided the loop into two threads, two execute concurrently.
One thread will carry out the operation for arr[0] to arr[n/2], another thread will work for arr[n/2] to arr[n-1].
long_operation function is thread safe.
Initially I was using join but it was taking higher sys time for futex system call, which I observed using strace command.
So i removed the strace command and use two volatile variable in the two threads to keep track whether thread is completed or not and a busy loop in the thread spawing function two halt the execution of later code. And I made the thread detachable and remove join.
It improved performance a little bit. but when i used time command, the sys part was taking,
real 0m31.368s
user 0m53.738s
sys 0m15.203s
but when i checked using the strace command the output was,
% time seconds usecs/call calls errors syscall
55.79 0.000602 9 66 clone
44.21 0.000477 3 177 write
------ ----------- ----------- --------- --------- ---------------
100.00 0.001079 243 total
So the time command was showing that around 15 seconds CPU spend in kernel within the process. But the strace command showing almost 0 seconds was utilized for system calls.
Then why 15 seconds was wasted in kernel?
I have an dual-core hyper-threaded Intel CPU.

How to use irq0 and the PIT in x86 Linux to run a callback a given frequency?

As far as I remember PIT (timer whose irq is 0) emits interrupt 16 or 18 times per second. This frequency (16 or 18 Hz) is just what I need for my application (it should emulate some physical device). Also, as far as I know irq 0 is used for task scheduler and it is triggered much more frequently than 18 Hz.
So, my question is: which is right? 18Hz or much more frequent? Another question is: is it ok to set my own irq 0 handler to be called after task scheduler (set handler with request_irq function)?

What is triggering an 0x08 interrupt?

I'm trying to hijack the Timer interrupt. A colleague told me that interrupt 0x08 on the IDT (Interrupt Descriptor Table) is the timer. Of curse I checked and saw two possible answers: this which says that 8 is the real clock timer and this saying it's the Double Fault interrupt - I decided to believe him and not waste time on checking further. After finally having control over the IDT and replacing interrupt 8, nothing is happening.
So what is going on?
Did this interrupt change its purpose over time from timer to double fault?
Does this interrupt has different purposes on ARM/Intel/etc.?
My code is a kernel module, that hijacks the interrupt 8 and simply do a printk command every time the interrupt arrives. I ran it for about 25 minutes - No output in dmesg.
In case it matters: I run Linux Mint with kernel 3.8 on a VM. The host has an Intel i5.

You can find which interrupt is for timer by using this command: cat /proc/interrupt
Following is a sample output on a 6 core machine:
cat /proc/interrupts | egrep "timer|rtc"
0: 320745126 0 0 0 0 0 IO-APIC-edge timer
8: 1 0 0 0 0 0 IO-APIC-edge rtc0
LOC: 115447297 304097630 194770704 212244137 63864376 69243268 Local timer interrupts
Note, timer and rtc are different. Also there is only one rtc interrupt so far. (Lots of timer interrupts). Following is the uptime output.
uptime
14:14:20 up 13 days, 3:58, 9 users, load average: 0.47, 1.68, 1.38
I think you should check this before you hack IDT. Also, probably, you want to hack interrupt 0, not 8.

You have found two descriptions for the same IRQ because in protected mode the address range 0x0 - 0x1F are reserved for internal cpu interruptions use.
You have to remap IRQs to another address space without conflicts, in this article you can find it explained with all the source code needed:
https://alfaexploit.com/readArticle/416

Way to measure time of execution program

I have a lots of short programs in C. Each program realize simple operation for example: include library, load something (ex matrix) from file, do simple operation, write matrix to file end.
I want to measure real time of excecution a whole program (not only fragment of code).
My simple idea is using htop or ps aux -> column time. But this method isn't good because I don't have exacly time of execution but time of excecution during last refresh and I can miss this.
Do you have any method to measure time of process in linux?

If your program is named foo, then simply typing
~$ time foo
should do exactly what you want.

In addition to other answers, mostly suggesting to use the time utility or shell builtins:
time(7) is a very useful page to read.
You might use (inside your code) the clock(3) standard function to get CPU time in microseconds.
Resolution and accuracy of time measures depends upon hardware and operating system kernel. You could prefer a "real-time" kernel (e.g. a linux-image-3.2.0-rt package), or at least a kernel configured with CONFIG_HZ_1000) to get more precise time measures.
You might also use (inside your code) the clock_gettime(2) syscall (so link also the -lrt library).
When doing measurements, try to have your measured process run a few seconds at least, and measure it several times (because e.g. of disk cache issues).

If you use
time <PROGRAM> [ARGS]
this will provide some base-level information. This should run your shell's time command. Example output:
$ time sleep 2
real 0m2.002s
user 0m0.000s
sys 0m0.000s
But there is also
/usr/bin/time <PROGRAM> [ARGS]
which is more flexible and provides considerably more diagnostic information regarding timing. This runs a GNU timing program. This site has some usage examples. Example output:
$ /usr/bin/time -v sleep 2
Command being timed: "sleep 2"
User time (seconds): 0.00
System time (seconds): 0.00
Percent of CPU this job got: 0%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:02.00
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 2496
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 202
Voluntary context switches: 2
Involuntary context switches: 1
Swaps: 0
File system inputs: 0
File system outputs: 0
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight