Linux soft lockup in multi-core with realtime task - c

I'm not sure whether it is Linux kernel bug while I searched many documents and could not find any hint.
I am asking this question to check if anyone has met similar issue and how to solve this.
Environment:
Linux Kernel: 2.6.34.10
CPU: MIPS 64 (total 8 cores)
application running in user space`
There are strict response time requirement with application, so the application threads were set in SCHED_FIFO, and some key threads are affinity to dedicated CPU core, everything is fine in this case. Later someone found that CPU peak happen (e.g. 60%-80% in short peak) sometimes in some CPU cores. To solve this, kept CPU 0 and CPU 7 for Linux native application, and isolated CPU 1-6 for our applications by adding "isolcpus=1-6" in boot line, issue of CPU peak was solved, while it lead the following issue.
The following message will be printed in console after running some time and system hangup, not always but sporadically. (it might happen in multiple CPU cores)
BUG: soft lockup - CPU#4 stuck for 61s! [swapper:0]
Modules linked in: hdml softdog cmt cmm pio clock linux_kernel_bde linux_uk_proxy linux_bcm_core mpt2sas
Cpu 4
$ 0 : 0000000000000000 ffffffffc3600020 ffffffffc1000b00 c0000001006f0010
$ 4 : 0000000000000001 0000000000000001 000000005410f8e0 ffffffffbfff00fe
$ 8 : 000000000000001e ffffffffc15b3c80 0000000000000002 0d0d0d0d0d0d0d0d
$12 : 0000000000000000 000000004000f800 0000000000000000 c000000100768000
$16 : ffffffffc36108e0 0000000000000010 ffffffffc35f0000 0000000000000000
$20 : 0000000000000000 0000000000000000 0000000000000000 0000000000000000
$24 : 0000000000000007 ffffffffc103b3a0
$28 : c0000001006f0000 c0000001006f3e38 0000000000000000 ffffffffc103d774
Hi : 0000000000000000
Lo : 003d0980b38a5000
epc : ffffffffc1000b20 r4k_wait+0x20/0x40
Not tainted
ra : ffffffffc103d774 cpu_idle+0xbc/0xc8
Status: 5410f8e3 KX SX UX KERNEL EXL IE
Cause : 40808000
looked at the callback trace, the thread was always pending on conditional variable wait, the pseudo wait/signal fnction is as following
int xxx_ipc_wait(int target)
{
struct timespec to;
.... /* other code */
clock_gettime(CLOCK_MONOTONIC, &to);
timespec_add_ns(&to, 1000000);
pthread_mutex_lock(&ipc_queue_mutex[target]);
ret = pthread_cond_timedwait (&ipc_queue_cond[target], &ipc_queue_mutex[target], &to);
pthread_mutex_unlock(&ipc_queue_mutex[target]);
return ret;
}
void xxx_ipc_signal_atonce(int target)
{
...
pthread_mutex_lock(&ipc_queue_mutex[target]);
pthread_cond_signal(&ipc_queue_cond[target]);
pthread_mutex_unlock(&ipc_queue_mutex[target]);
}
those waits should wakeup anyhow because it is timeout conditional variable. Even created a dedicated Linux thread to signal those conditional variable timely, e.g. every 5 seconds, issue still there.
checked the kernel log with "dmesg" and didn't found any valuable log. when enabled the kernel debug and check the kernel log /proc/sched_debug, there are strange information as following.
cpu#1 /* it is a normal CPU core */
.nr_running : 1
.load : 0
.nr_switches : 1892378
.nr_load_updates : 167378
.nr_uninterruptible : 0
.next_balance : 4295.060682
.curr->pid : 235 /* it point to the runnable tasks */
task PID tree-key switches prio exec-runtime sum-exec sum-sleep
----------------------------------------------------------------------------------------------------------
R aaTask 235 0.000000 157 49 0 0
cpu#4
.nr_running : 1 /* okay */
.load : 0
.nr_switches : 2120455 /* this value changes from time to time */
.nr_load_updates : 185729
.nr_uninterruptible : 0
.next_balance : 4295.076207
.curr->pid : 0 /* why this is ZERO since it has runable task */
.clock : 746624.000000
.cpu_load[0] : 0
.cpu_load[1] : 0
.cpu_load[2] : 0
.cpu_load[3] : 0
.cpu_load[4] : 0
cfs_rq[4]:/
.exec_clock : 0.000000
.MIN_vruntime : 0.000001
.min_vruntime : 14.951424
.max_vruntime : 0.000001
.spread : 0.000000
.spread0 : -6833.777140
.nr_running : 0
.load : 0
.nr_spread_over : 0
.shares : 0
rt_rq[4]:/
.rt_nr_running : 1
.rt_throttled : 1
.rt_time : 900.000000
.rt_runtime : 897.915785
runnable tasks:
task PID tree-key switches prio exec-runtime sum-exec sum-sleep
----------------------------------------------------------------------------------------------------------
bbbb_appl 299 6.664495 1059441 49 0 0 0.000000 0.000000 0.000000 /
I don't know why Linux system work like this, and finally, I changed the task priority from SCHED_FIFO to SCHED_OTHER, and this issue didn't happen after months' running. since CPU core is isolated, so system's behavior is similar between SCHED_FIFO and SCHED_OTHER, also SCHED_OTHER is more widely used.

App waiting on a condition/mutex forever might be a sign of priority inversion, unless it's using priority inheritance enabled synchronization primitives.
In the FIFO realtime scheduling mode the thread has CPU until it voluntarily gives it up. Which is quite different from the preemptive multitasking most software is written for.
Unless your software explicitly has REALTIME_FIFO in the configuration requirements, I would not spend time and rather stick with RR and/or CPU pinning/isolation.

Related

Shortest path in Answer Set Programming

I'm trying to find all the shortest path from one source node to all other destination node (so 1-3, 1-5, 1-4) with the relative cost for each shortest path.
I've tried with this code
node(1..5).
edge(1,2,1).
edge(2,3,9).
edge(3,4,4).
edge(4,1,4).
edge(1,3,1).
edge(3,5,7).
start(1).
end(3).
end(4).
end(5).
0{selected(X,Y)}1:-edge(X,Y,W).
path(X,Y):-selected(X,Y).
path(X,Z):-path(X,Y),path(Y,Z).
:-start(X),end(Y),not path(X,Y).
cost(C):-C=#sum{W,X,Y:edge(X,Y,W),selected(X,Y)}.
#minimize{C:cost(C)}.
#show selected/2.
but my code return this answer
> `clingo version 5.6.0 (c0a2cf99)
> Reading from stdin
> Solving...
> Answer: 1
> selected(3,4) selected(1,3) selected(3,5)
> Optimization: 12
> OPTIMUM FOUND
>
> Models : 1
> Optimum : yes
> Optimization : 12
> Calls : 1
> Time : 0.043s (Solving: 0.00s 1st Model: 0.00s Unsat: 0.00s)
> CPU Time : 0.000s`
What is wrong? How can I enumerate all shortest paths with relative costs?
Surely an error is that you are aggregating all the costs in C but, if I have correctly understood, you need distinct costs depending on the ending node.
Then there may be also other errors, but I can't exactly understand what do you mean with that program.
I would write it as follows:
node(1..5) .
edge(1,2,1) .
edge(2,3,9) .
edge(3,4,4) .
edge(4,1,4) .
edge(1,3,1) .
edge(3,5,7) .
start(1) .
end(3) .
end(4) .
end(5) .
% For each destination E, some outgoing edge from the start node should be selected
:- start(S), end(E), not selected(S,_,E) .
% No edge pointing to the start node should be selected
:- start(S), selected(_,S,_) .
% If an edge points to the end node, then it may be (or not be) selected for reaching it
0{selected(X,E,E)}1 :- edge(X,E,_), end(E) .
% If an outgoing edge from Y has been selected for reaching E, then an incoming edge may be (or not be) selected for reaching E
0{selected(X,Y,E)}1 :- edge(X,Y,_), selected(Y,_,E) .
% Compute the cost for reaching E
cost(E,C) :- C=#sum{W : edge(X,Y,W), selected(X,Y,E)}, end(E) .
#minimize{C : cost(E,C)} .
#show selected/3 .
#show cost/2 .
The execution of the above program is as follows:
clingo version 5.3.0
Reading from test.lp
Solving...
Answer: 1
selected(3,5,5) selected(1,3,3) selected(3,4,4) selected(1,3,4) selected(1,3,5) cost(3,1) cost(4,5) cost(5,8)
Optimization: 14
OPTIMUM FOUND
Models : 1
Optimum : yes
Optimization : 14
Calls : 1
Time : 0.017s (Solving: 0.00s 1st Model: 0.00s Unsat: 0.00s)
CPU Time : 0.000s
where:
an atom select(X,Y,Z) indicates that the edge (X,Y) has been selected for reaching the node Z;
an atom cost(E,C) indicates that the minimum cost for reaching the end node E is C.
The starting node is implicit since it is unique.

What do the values in 'alsa --dump-hw-params' represent?

I am having problems configuring ALSA on my RHEL 7.5 machine.
Part of my solution is to attempt to change settings in /etc/asound.conf. I have tried numerous permutations but I continue to hear "jitter" in my sounds (.raw files).
I am using the 'aplay --dump-hw-params to get the params for my sound HW.
Using this command:
aplay --dump-hw-params Front_Center.wav
These are the results I get:
Playing WAVE 'Front_Center.wav' : Signed 16 bit Little Endian, Rate 48000 Hz, Mono
HW Params of device "default":
--------------------
ACCESS: MMAP_INTERLEAVED MMAP_NONINTERLEAVED MMAP_COMPLEX RW_INTERLEAVED RW_NONINTERLEAVED
FORMAT: S8 U8 S16_LE S16_BE U16_LE U16_BE S24_LE S24_BE U24_LE U24_BE S32_LE S32_BE U32_LE U32_BE FLOAT_LE FLOAT_BE FLOAT64_LE FLOAT64_BE MU_LAW A_LAW IMA_ADPCM S24_3LE S24_3BE U24_3LE U24_3BE S20_3LE S20_3BE U20_3LE U20_3BE S18_3LE S18_3BE U18_3LE U18_3BE
SUBFORMAT: STD
SAMPLE_BITS: [4 64]
FRAME_BITS: [4 640000]
CHANNELS: [1 10000]
RATE: [4000 4294967295)
PERIOD_TIME: (11609 11610)
PERIOD_SIZE: (46 49864571)
PERIOD_BYTES: (23 4294967295)
PERIODS: (0 17344165)
BUFFER_TIME: [1 4294967295]
BUFFER_SIZE: [92 797831566]
BUFFER_BYTES: [46 4294967295]
TICK_TIME: ALL
--------------------
I'd like to know what the values within parens and braces mean in general.
Are they ranges?
What is the difference between the use of parens vs. braces?
Thanks,
Ian
Minimum and maximum values as supported by the specific hardware device you are using.

ALSA unexpected underrun with seemingly correct timings

I am going mad about spurious underrun errors on snd_pcm_writei() calls.
I use a blocking setup:
snd_pcm_open(&handle, "default", SND_PCM_STREAM_PLAYBACK, 0);
Here is the snd_pcm_dump_sw_setup() output:
tstamp_mode : NONE
tstamp_type : MONOTONIC
period_step : 1
avail_min : 1764
period_event : 0
start_threshold : 1
stop_threshold : 3528
silence_threshold: 0
silence_size : 0
boundary : 1849688064
While setting the hardware parameters, I log the results of the snd_pcm_hw_params_set_rate_near(), snd_pcm_hw_params_set_periods_near(), snd_pcm_hw_params_set_period_size_near() calls:
3719.1287 D [AlsaSound] SOUND: setupWithFreq sampling rate: 44100, dir: 0
3719.1288 D [AlsaSound] SOUND: number of periods: 2, dir: 0
3719.1289 D [AlsaSound] SOUND: period size: 1764 frames, dir: 0
Here is the relevant part from the filling loop, what is called repeatedly:
log.debug("play %d samples", n);
while ((ret = snd_pcm_writei(handle, playBuf, n)) != (long)n) {
if (ret < 0) {
log.warn("ALSA error: %s\n", snd_strerror(ret));
if ((ret = snd_pcm_recover(handle, ret, 0)) < 0) {
log.error("ALSA error after recover: %s\n", snd_strerror(ret));
checkFatalAlsaError(snd_pcm_prepare(handle), "ALSA irrecoverable error: %s");
}
} else {
log.warn("ALSA short write...?\n");
break;
}
}
Here is the log when everything is fine:
3751.3029 D [AlsaSound] Starting square sound, nsamples: 3528, nPeriods: 2, nFrames: 1764
3751.3030 D [AlsaSound] play 1739 samples
3751.3037 D [AlsaSound] play 1739 samples
3751.3046 D [AlsaSound] play 50 samples
3751.3048 D [AlsaSound] Stop sound
And sometimes I get this:
3752.8764 D [AlsaSound] Setup square sound, time: 800, nsamples: 3528
3752.8769 D [AlsaSound] Starting square sound, nsamples: 3528, nPeriods: 2, nFrames: 1764
3752.8770 D [AlsaSound] play 1739 samples
3752.8779 D [AlsaSound] play 1739 samples
3752.8782 W [AlsaSound] ALSA error: Broken pipe
ALSA lib ../../../alsa-lib-1.1.4.1/src/pcm/pcm.c:8323:(snd_pcm_recover) underrun occurred
3752.8792 D [AlsaSound] play 50 samples
3752.8793 D [AlsaSound] Stop sound
From the log timestamps it is visible that the underrun occurs within 2ms of the first write -- what writes ~40ms of samples. The two shown examples are identical in any other way, the device is not playing sound and is prepare()'d.
What can be the problem, and the solution?
Please note that it is intentional that I write fewer samples than the period size.
Linux kernel version 4.9.87, libasound2 version 1.1.4.1-r0, ARM (Colibri iMX6) platform

Elixir/Erlang: How to find the source of high CPU usage?

My Elixir app is using about 50% of the CPU, but it really should only be using <1%. I'm trying to figure out what is causing the high CPU usage and I'm having some trouble.
In a remote console, I tried
Listing all processes with Process.list
Looking at the process info with Process.info
Sorting the processes by reduction count
Sorting the processes by message queue length
The message queues are all close to 0, but the reduction counts are very high for some processes. The processes with high reduction counts are named
:file_server_2
ReactPhoenix.ReactIo.Pool
:code_server
(1) and (3) are both present in my other apps, so I feel like it must be (2). This is where I'm stuck. How can I go further and figure out why (2) is using so much CPU?
I know that ReactPhoenix uses react-stdio. Looking at top, react-sdtio doesn't use any resources, but the beam does.
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 87 53.2 1.2 2822012 99212 ? Sl Nov20 580:03 /app/erts-9.1/bin/beam.smp -Bd -- -root /app -progname app/releases/0.0.1/hello.sh -- -home /root -- -noshell -noshell -noinput -boot /app/
root 13873 0.0 0.0 4460 792 ? Rs 13:54 0:00 /bin/sh -c deps/react_phoenix/node_modules/.bin/react-stdio
I saw in this StackOverflow post that stdin can cause resource issues, but I'm unsure if that applies here. Anyway, any help would be greatly appreciated!
Did you try etop?
iex(2)> :etop.start
========================================================================================
nonode#nohost 14:57:45
Load: cpu 0 Memory: total 26754 binary 143
procs 51 processes 8462 code 7201
runq 0 atom 292 ets 392
Pid Name or Initial Func Time Reds Memory MsgQ Current Function
----------------------------------------------------------------------------------------
<0.6.0> erl_prim_loader '-' 458002 109280 0 erl_prim_loader:loop
<0.38.0> code_server '-' 130576 196984 0 code_server:loop/1
<0.33.0> application_controll '-' 58731 831632 0 gen_server:loop/7
<0.88.0> etop_server '-' 58723 109472 0 etop:data_handler/2
<0.53.0> group:server/3 '-' 19364 2917928 0 group:server_loop/3
<0.61.0> disk_log:init/2 '-' 16246 318352 0 disk_log:loop/1
<0.46.0> file_server_2 '-' 3838 18752 0 gen_server:loop/7
<0.51.0> user_drv '-' 3720 13832 0 user_drv:server_loop
<0.0.0> init '-' 2559 34440 0 init:loop/1
<0.37.0> kernel_sup '-' 2093 58600 0 gen_server:loop/7
========================================================================================
http://erlang.org/doc/man/etop.html

BPF write fails with 1514 bytes

I'm unable to write 1514 bytes (including the L2 information) via write to /dev/bpf. I can write smaller packets (meaning I think the basic setup is correct), but I see "Message too long" with the full-length packets. This is on Solaris 11.2.
It's as though the write is treating this as the write of an IP packet.
Per the specs, there 1500 bytes for the IP portion, 14 for the L2 headers (18 if tagging), and 4 bytes for the checksum.
I've set the feature that I thought would prevent the OS from adding its own layer 2 information (yes, I also find it odd that a 1 disables it; pseudo code below):
int hdr_complete = 1;
ioctl(bpf, BIOCSHDRCMPLT, &hdr_complete);
The packets are never larger than 1514 bytes (they're captured via a port span and start with the source and destination MAC addresses; I'm effectively replaying them).
I'm sure I'm missing something basic here, but I'm hitting a dead end. Any pointers would be much appreciated!
Partial Answer: This link was very helpful.
Update 3/20/2017
Code works on Mac OS X, but on Solaris results in repeated "Interrupted system call" (EINTR). I'm starting to read scary things about having to implement signal handling, which I'd rather not do...
Sample code on GitHub based on various code I've found via Google. On most systems you have to run this with root privileges unless you've granted "net_rawaccess" to the user.
Still trying to figure out the EINTR issue. Output from truss:
27158/1: 0.0122 0.0000 write(3, 0x08081DD0, 1514) Err#4 EINTR
27158/1: \0 >E1C09B92 4159E01C694\b\0 E\005DC82E1 #\0 #06F8 xC0A81C\fC0A8
27158/1: 1C eC8EF14 Q nB0BC 4 V #FBDE8010FFFF8313\0\00101\b\n ^F3 W # C E
27158/1: d SDD G14EDEB ~ t sCFADC6 qE3C3B7 ,D9D51D VB0DFB0\b96C4B8EC1C90
27158/1: 12F9D7 &E6C2A4 Z 6 t\bFCE5EBBF9C1798 r 4EF "139F +A9 cE3957F tA7
27158/1: x KCD _0E qB9 DE5C1 #CAACFF gC398D9F787FB\n & &B389\n H\t ~EF81
27158/1: C9BCE0D7 .9A1B13 [ [DE\b [ ECBF31EC3 z19CDA0 #81 ) JC9 2C8B9B491
27158/1: u94 iA3 .84B78AE09592 ;DA ] .F8 A811EE H Q o q9B 8A4 cF1 XF5 g
27158/1: EC ^\n1BE2C1A5C2 V 7FD 094 + (B5D3 :A31B8B128D ' J 18A <897FA3 u
EDIT 7 April 2017
The EINTR problem was the result of a bug in the sample code that I placed on GitHub. The code was not associating the bpf device with the actual interface and Solaris was throwing the EINTR as a result.
Now I'm back to the "message too long" problem that I still haven't resolved.

Resources