Why looses this Armbian repeatedly connectivity? - armbian

I have an Olimex Lime2 running an Armbian, headless. On this board I only care for SSH and MiniDLNA. I hope to get around to include the whole configuration, but one important bit might be that in /boot/armbianEnv.txt I put
extraargs=acpi=off
For one year now I experience very hard to debug issues with availability. The machine randomly stops to be accessible via ping or ssh. The issues are hard to debug because they seem to disappear when connecting monitor or keyboard while I can't find any trace of them when the system runs headless. While I got the problem mostly under control without knowing how, the Olimex still stops to respond now and then. This time I want to ask why.
I noticed that the Olimex stopped to provide DLNA access at 10/25th, ~2pm. I did not touch it to see if it recovers (which happens sometimes). This time the system remained unreachable for 2 days until I unplugged power.
Below you can find links to two logs. I would be very glad if anything suspicious in them could be pointed out so I can start resolving them.
One particular thing I wonder: Why did the system decided to reboot? There was no power outage at that day. I expect that a normal reboot would manifest itself in the logs, do they?
The logs:
/var/logs/messages: https://pastebin.com/qgRumreB
/var/logs/syslog: https://pastebin.com/U5jpHNHm
The logs are complete. I only removed lines in the beginning and at the end, but not in between.

Albeit I did not find a clear solution, I want to share what I have tried. Hopefully this will provide inspiration for others with similar problems.
One thing helping me a lot (which wasn't available back then) was to switch to a newer OS. I am running Armbian based on Ubuntu 18.04 and the logs are less cluttered now. Also, some minor details improved. If you landed here and still run Armbian Stretch, you should upgrade.
One reason for frequent crashes could be a corrupt file system, caused by frequent crashes :-(. Armbian mounts your SD card with
UUID=<uid> / ext4 defaults,noatime,nodiratime,commit=600,errors=remount-ro 0 1
note the commit=600, which means that changes will be written only every 10 minutes. If your machine crashes inbetween, the file system might become corrupt. Thus, you might run fsck.ext4 on your SD card file system. In order to counteract the problem at all, you could:
run fsck at every boot
omit the commit setting. My SD card can handle the additional stress quite well.
What I think solved the problem for me is to put my external HDD, attached via SATA, to sleep. I was surprised that this does not happen automatically. Now I have the following section attached to my /etc/hdparm.conf:
/dev/disk/by-uuid/<uid> {
spindown_time = 60
write_cache = off
}
This tells hdparm to put the HDD to standby after 5 minutes of inactivity. Turning the write cache off is a safety measure against file system corruption again. Some disks reorder BtrFS write instructions when they shouldn't.
Something that I observed is that putting some load on the machine helped to keep the system alive. Unfortunately, that was true only until some point in time. However, if attaching Keyboard and Mouse or letting a script run throughout helps with restarts, you will have something to work with.
I use the following script to log information that might help me in case of a crash:
#!/bin/bash
# LICENSE: GPLv3 or later
set -euo pipefail
LOGFILE=/home/mgoerner/error-detection.log
function main() {
parse_cli_args "$#"
while true
do
print_debug_information >>"$LOGFILE" 2>&1
sync
sleep 3m
done
}
function parse_cli_args() {
if [[ $# -eq 1 ]]
then
arg="$1";shift
if [[ "$arg" == "--help" || "$arg" == "-h" ]]
then
print_usage
exit
fi
LOGFILE="$arg"
elif [[ $# -gt 1 ]]
then
echo "Please provide at most one argument!" >&2
exit 1
fi
}
function print_usage() {
cat <<EOF
$0 [LOGFILE]
EOF
}
function print_debug_information() {
echo
date
uptime
dmesg -uT | tail
ip addr show wlxd85d4c97e434
iwlist wlxd85d4c97e434 scan | egrep ESSID
hdparm -acdgkmurC /dev/sda
free
}
main "$#"
I let it start automatically after boot. Setting the sleep time under ~10 minutes used to let the crashes disappear, but does not anymore. Unfortunately, the error log produced by this script never helped getting any insight. The same goes for various logs from /var/log/. However, that might be different for you.
Furthermore, I suspect that my WiFi dongle does not like it warm. I reuse a childs shoe box as case and putting the dongle inside the closed box caused some connectivity problems.
Last but not least, I turned off automatic updates after reboots. Very often the crashes happened directly after some (mysterious) reboot. Having turned of automatic updates helped me getting rid of this altogether.

Related

Automatically connect to a CGI process and break in GDB before it exits?

For a C application accessed via CGI-BIN, documentation online for accessing the process and breaking in GDB relies on manipulating the source code (i.e. adding an infinite loop), in order for the process to be available long enough for a developer to attach, exit the loop, and debug.
Is it feasible that a tool could monitor the process list, and attach via GDB, immediately breaking in order for a developer to achieve this without requiring source code changes?
The rough structure of what I have in mind to develop is something along the lines of:
1. My process monitors the process list on the system.
2. A process matching the name of my application, and owner Apache appears in the list.
3. My process immediately performs a 'pgrep' and 'gdb -p' command, then sending a break-point command to pause the process.
4. The developer can then access the process and look at the flow of execution.
Is this feasible as an idea or not possible due to some constraints (i.e. a race condition which may not always be fufilled?)
Is this feasible
Sure: a trivial shell script will do:
while true; do
PID=$(pgrep my_app)
if [[ -n "$PID" ]]; then
gdb -p "$PID"
fi
done
a race condition
The problem is that between pgrep and gdb -p the application may make significant progress, or even run to completion.
The only way to avoid that is to intercept all execve system calls on the system, as Tom Tromey's preattach.stp does.

How to display top largest files in a non blocking manner on linux?

For years I have being using variasons of du command below in order to produce a report of the largest files from specific location, and most of the time it worked well.
du -L -ch /var/log | sort -rh | head -n 10 &> log-size.txt
This this proved to get stuck in several cases, in a way that prevented stopping it with even the timeout -s KILL 5m ... approach.
Few years back this was caused by stalled NFS mounts but more recently I got into this in on VMs where I didn't use NFS at all. Apparently there is a ~1:30 chance to get this on openstack builds.
I read that following symbolic links (-L) can block "du" in some cases if there are loops but my tests failed to reproduce the problem, even when I created some loop.
I cannot avoid following the symlinks because that's how the files are organized.
What would be safer alternative to generate this report, one that would not block or at least if it does, it can be constrainted to a maximum running duration. It is essential to limit the execution of this command to a number of minutes -- if I can also get a partial result on timeouts or some debuggin info even better.
If you don't care about sparse files and can make do with apparent size (and not the on-disk size), then ls should work just fine: ls -L --sort=s|head -n10> log-size.txt

How can I set file creation times in ZFS?

I've just got a NAS running ZFS and I'd like to preserve creation times when transferring files into it. Both linux/ext4 (where the data is now) and zfs store creation time or birth time. In the case of zfs it's even reported by the stat command. But I haven't been able to figure out how I can set the creation time of a file so it mirrors the creation time in the original file system. Unlike an ext4->ext4 transfer where I can feed debugfs a script to set the file creation times.
Is there a tool similar to debugfs for ZFS?
PS. To explain better:
I have a USB drive attached to a Ubuntu 14.04 laptop. It holds a file system where I care about the creation date (birth date) of the individual files. I consult these creation timestamps often using a script based on debugfs, which reports it as crtime.
I want to move the data to a NAS box running ZFS, but the methods I know (scp -p -r, rsync -a, and tar, among others I've tried) preserve the modification time but not the creation time.
If I were moving to another ext4 file system I would solve the problem using the fantastic tool debugfs. Specifically I can make a list of (filename, crtime) pairs on the source fs (file system), then use debugfs -w on the target fs to read a script with lines of the form
set_inode_field filename crtime <value>
I've tested this and it works just fine.
But my target fs is not ext4 but ZFS and although debugfs runs on the target machine, it is entirely useless there. It doesn't even recognize the fs. Another debug tool that lets you alter timestamps by editing an inode directly is fsdb; it too runs on the target machine, but again I can't seem to get it to recognize a ZFS file system.
I'm told by the folks who sold me the NAS box that debugfs and fsdb are not meant for ZFS filesystems, but they haven't been able to come up with an equivalent. So, after much googling and trying out things I finally decided to post a question here today, hoping someone might have the answer.
I'm surprised at how hard this is turning out to be. The question of how to replicate a dataset so all timestamps are identical seems quite natural from an archival point of view.
Indeed, neither fsdb nor debugfs are likely to be suitable for use with ZFS. What you might need to do instead is find an archive format that will the preserve crtime field that presumably is already set for the files on your fileserver. If there is a version of pax or another archiving tool for your system it may be able to do this (cf. the -pe "preserve everything" flag for pax which it seems in current versions does not preserve "everything" - viz. it does not preserve crtime/birth_time). You will likely have more success finding an archiving application that is "crtime aware" than trying set the creation times by hacking on the ZFS based FreeBSD system with what are likely to be rudimentary tools.
You may be able to find more advanced tools on OpenSolaris based systems like Illumos or SmartOS (e.g. mdb). Whether it would be possible to transfer your data to a ZFS dataset on one of those platforms and then, combining the tools they have with, say, dtrace in order to rewrite the crtime fields is more of a theoretical question. If it worked then you could export the pool and its datasets to FreeBSD - exporting a pool does seem to preserve the crtime time stamps. If you are able to preserve crtime while dumping your ext4 filesystem to a ZFSonLinux dataset on the same host (nb: I have not tested this) you could then use zfs send to transfer the whole filesystem to your NAS.
This core utils bug report may shed some light on the state of user and operating system level tools on Linux. Arguably the filesystem level crtime field of an inode should be difficult to change. While ZFS on FreeBSD "supports" crtime, the state of low level filesystem debugging tools on FreeBSD might not have kept pace in earlier releases (c.f. the zdb manual page). Are you sure you want to "set" (or reset) inode creation times? Or do you want to preserve them after they have been set on a system that already supports them?
On a FreeBSD system if you stat a file stored on a ZFS dataset you will often notice that the crtime field of the file is set to the same time as the ctime field. This is likely because the application that wrote the file did not have access to library and kernel functions required to set crtime at the time the file was "born" and its inode entries were created. There are examples of applications / libraries that try to preserve crtime at the application level such as libarchive(3) (see also: archive_entry_atime(3)) and gracefully handle inode creation if the archive is restored on a filesystem that does not support the crtime field. But that might not be relevant in your case.
As you might imagine, there are a lot of applications that write files to filesystems ... especially with Unix/POSIX systems where "everything is a file". I'm not sure if older applications would need to be modified or recompiled to support those fields, or whether they would pick them up transparently from the host system's C libraries. Applications being used on older FreeBSD releases or on a Linux system without ext4 could be made to run in compatibility mode on an up to date OS, for example, but whether they would properly handle the time fields is a good question.
For me running this little script as sh birthtime_test confirms that file creation times are "turned on" on my FreeBSD systems (all of which use ZFS post v28 i.e. with feature flags):
#!/bin/sh
#birthtime_test
uname -r
if [ -f new_born ] ; then rm -f new_born ; fi
touch new_born
sleep 3
touch -a new_born
sleep 3
echo "Hello from new_born at:" >> new_born
echo `date` >> new_born
sleep 3
chmod o+w new_born
stat -f "Name:%t%N
Born:%t%SB
Access:%t%Sa
Modify:%t%Sm
Change:%t%Sc" new_born
cat new_born
Output:
9.2-RELEASE-p10
Name: new_born
Born: May 7 12:38:35 2015
Access: May 7 12:38:38 2015
Modify: May 7 12:38:41 2015
Change: May 7 12:38:44 2015
Hello from new_born at:
Thu May 7 12:38:41 EDT 2015
(NB: the chmod operation "changes" but does not "modify" the file contents - this is what the echo command does by adding content to the file. See the touch manual page for explanations of the -m and -a flags).
This is the oldest FreeBSD release I have access to right now. I'd be curious to know how far back in the release cycle FreeBSD is able handle this (on ZFS or UFS2 file systems). I'm pretty sure this has been a feature for quite a while now. There are also OSX and Linux versions of ZFS that it would be useful to know about regarding this feature.
Just one more thing ...
Here is an especially nice feature for simple "forensics". Say we want to send our new_born file back to when time began, back to the leap second that never happened and when - in a moment of timeless time - Unix was born ... :-) 1. We can just change the date using touch -d and everyone will think new_born is old and wise, right?
Nope:
~/ % touch -d "1970-01-01T00:00:01" new_born
~/ % stat -f "Name:%t%N
Born:%t%SB
Access:%t%Sa
Modify:%t%Sm
Change:%t%Sc" new_born
Name: new_born
Born: May 7 12:38:35 2015
Access: Jan 1 00:00:01 1970
Modify: Jan 1 00:00:01 1970
Change: May 7 13:29:37 2015
It's always more truthful to actually be as young as you look :-)
Time and Unix - a subject both practical and poetic: after all, what is "change"; and what does it mean to "modify" or "create" something? Thanks for your great post Silvio - I hope it lives on and gathers useful answers.
You can improve and generalize your question if you can be more specific about your requirements for preserving, setting, archiving of file timestamp fields. Don't get me wrong: this is a very good question and it will continue to get up votes for a long time.
You might take a look at Dylan Leigh's presentation Forensic Timestamp Analysis of ZFS or even contact Dylan for clues on how to access crftime information.
[1] There was a legend that claimed in the beginning, seconds since long (SSL) ago was never less than date -u -j -f "%Y-%m-%d:%T" "1970-01-01:00:00:01" "+%s" because of a leap second ...

How to identify gnome-terminal profile?

I posted a question on askubuntu a while back. As there is no action, and I have also dug some more, I'll try here. Possibly a more correct place, (and I don't know if it is possible to move questions any more (I do not get any of those options)).
Anyhow:
Is there a way to get gnome-terminal profile ID? Need it in bash script – to do e.g. –
gconftool-2 "do some change to some value for current profile."
In my endeavour for an answer to this I have made some progress – but no satisfying solution. To be honest it truly scares me how shielded the application is from doing modifications from command line being a terminal emulator! To me it is incomprehensible.
Besides touching the source of gnome-terminal, (I do not wan a custom version), is there some legit way to get this? By the fact it is a wrapper for vte, it uses various shared libraries, some way I haven’t thought of, etc.
Add some C code into the mix is OK.
So far:
I have checked out the "save-config" option, but as it is 1. not satisfactory, aka 100%, and 2. more important this also is going to be removed it fails completely. See my own answer below for more detail.
There is no environment variable for this.
dbus: Doesn't seem to be any messages transmitted or any functions available for this kind of information. Have tested both current (3.6.0) version and latest develop.
injection: tho it is probable, and have played around with injecting custom code to it, it is such an error-prone endeavour it is not a solution.
If anyone wonder etc.
Decided to have another look at this - and made a little progress.
Using the built-in option --save-config there is these properties of interest:
Role=gnome-terminal-window-2587-1856448950-1359348087
ActiveTerminal=Terminal0xa896200
Geometry=110x87+900+1
WorkingDirectory=/home/xxx/tmp
Looking at it closer. Opened two windows in short succession and did a save-config.
Role
We can split it to the various parts:
gnome-terminal-window
2587
1856448950
1359348087
PID
2587 is same for both, and after a quick pstree 2587 -p we find it to be the PID. Further an echo $$ locates our bash (or which ever one prefer).
Time of start
Now the second number is wildly different giving a clue it is probably a random value. The last one, tho, is with only the last digit in diff. Most probably a time-stamp. I know I'm in tmp directory for this window - so, by using our knowledge of the proc file system:
# btime: boot time, in seconds since the Epoch
$ cat /proc/stat | grep ^btime | cut -d' ' -f2
1359039155
# starttime: The time in jiffies the process started after system boot.
$ cat /proc/$$/stat | cut -d' ' -f22
30893222
# WANT: 1359348087
btime + starttime / Hertz
1359039155 + (30893222 / 100) = 1359348087.22 ~ 1359348087
OK. Last digit is time-stamp on start by Epoch. But unfortunately it is not by jiffies and a rounded value so if we have started several windows by e.g. a script we can end up with same value.
(After some checking it also seems like seconds is rounded by round to nearest not towards zero etc.)
Random value
OK. So what about the value after PID? Most probably a random value, but to be sure. To check this we have to go to the source.
$ git clone git://git.gnome.org/gnome-terminal
$ gnome-terminal --version
GNOME Terminal 3.6.0
$ git log --grep="3\.6\.0"
commit f4d291a90dc4f513fc15f80fdebcdc3c3349b70a
...
Version 3.6.0
$ git checkout f4d291a90dc4f513fc15f80fdebcdc3c3349b70a
After some digging we find:
# terminal-util.c
48: void
terminal_util_set_unique_role (GtkWindow *window, const char *prefix)
{
char *role;
role = g_strdup_printf(
"%s-%d-%d-%d",
prefix,
getpid(),
g_random_int(),
(int) time (NULL)
);
gtk_window_set_role (window, role);
g_free (role);
}
OK. Not only do we confirm that the second is a random value, but also that PI and time is correct.
Geometry
xwininfo -id $(xdotool getactivewindow) | \
grep '^\s*-geometry' | \
sed 's/^\s*[^ ]* \(.*\)/\1/'
# yields 110x87+900+1
OK. Now we have three values to check against:
Time
Geometry
Path
Problem is that even with this we can easily have two windows with those values at same value. And more important; some genius has decided to remove this from the options of the application.
Terminal Window hex
Looking further at the code one find that the hex value in ActiveTerminal etc. is a pointer value to current address in memory of a struct holding current window. AKA not very usefull if one doesn't want to hack memory mappings.

How to detect pending system shutdown on Linux?

I am working on an application where I need to detect a system shutdown.
However, I have not found any reliable way get a notification on this event.
I know that on shutdown, my app will receive a SIGTERM signal followed by a SIGKILL. I want to know if there is any way to query if a SIGTERM is part of a shutdown sequence?
Does any one know if there is a way to query that programmatically (C API)?
As far as I know, the system does not provide any other method to query for an impending shutdown. If it does, that would solve my problem as well. I have been trying out runlevels as well, but change in runlevels seem to be instantaneous and without any prior warnings.
Maybe a little bit late. Yes, you can determine if a SIGTERM is in a shutting down process by invoking the runlevel command. Example:
#!/bin/bash
trap "runlevel >$HOME/run-level; exit 1" term
read line
echo "Input: $line"
save it as, say, term.sh and run it. By executing killall term.sh, you should able to see and investigate the run-level file in your home directory. By executing any of the following:
sudo reboot
sudo halt -p
sudo shutdown -P
and compare the difference in the file. Then you should have the idea on how to do it.
There is no way to determine if a SIGTERM is a part of a shutdown sequence. To detect a shutdown sequence you can either use use rc.d scripts like ereOn and Eric Sepanson suggested or use mechanisms like DBus.
However, from a design point of view it makes no sense to ignore SIGTERM even if it is not part of a shutdown. SIGTERM's primary purpose is to politely ask apps to exit cleanly and it is not likely that someone with enough privileges will issue a SIGTERM if he/she does not want the app to exit.
From man shutdown:
If the time argument is used, 5 minutes before the system goes down
the /etc/nologin file is created to ensure that further logins shall
not be allowed.
So you can test existence of /etc/nologin. It is not optimal, but probably best you can get.
Its a little bit of a hack but if the server is running systemd if you can run
/bin/systemctl list-jobs shutdown.target
... it will report ...
JOB UNIT TYPE STATE
755 shutdown.target start waiting <---- existence means shutting down
1 jobs listed.
... if the server is shutting down or rebooting ( hint: there's a reboot.target if you want to look specifically for that )
You will get No jobs running. if its not being shutdown.
You have to parse the output which is a bit messy as the systemctl doesnt return a different exit code for the two results. But it does seem reasonably reliable. You will need to watch out for a format change in the messages if you update the system however.
Making your application responding differently to some SIGTERM signals than others seems opaque and potentially confusing. It's arguable that you should always respond the same way to a given signal. Adding unusual conditions makes it harder to understand and test application behavior.
Adding an rc script that handles shutdown (by sending a special signal) is a completely standard way to handle such a problem; if this script is installed as part of a standard package (make install or rpm/deb packaging) there should be no worries about control of user machines.
I think I got it.
Source =
https://github.com/mozilla-b2g/busybox/blob/master/miscutils/runlevel.c
I copy part of the code here, just in case the reference disappears.
#include "libbb.h"
...
struct utmp *ut;
char prev;
if (argv[1]) utmpname(argv[1]);
setutent();
while ((ut = getutent()) != NULL) {
if (ut->ut_type == RUN_LVL) {
prev = ut->ut_pid / 256;
if (prev == 0) prev = 'N';
printf("Runlevel: prev=%c current=%c\n", prev, ut->ut_pid % 256);
endutent();
return 0;
}
}
puts("unknown");
see man systemctl, you can determine if the system is shutting down like this:
if [ "`systemctl is-system-running`" = "stopping" ]; then
# Do what you need
fi
this is in bash, but you can do it with 'system' in C
The practical answer to do what you originally wanted is that you check for the shutdown process (e.g ps aux | grep "shutdown -h" ) and then, if you want to be sure you check it's command line arguments and time it was started (e.g. "shutdown -h +240" started at 14:51 will shutdown at 18:51).
In the general case there is from the point of view of the entire system there is no way to do this. There are many different ways a "shutdown" can happen. For example someone can decide to pull the plug in order to hard stop a program that they now has bad/dangerous behaviour at shutdown time or a UPS could first send a SIGHUP and then simply fail. Since such a shutdown can happen suddenly and with no warning anywhere in a system there is no way to be sure that it's okay to keep running after a SIGHUP.
If a process receives SIGHUP you should basically assume that something nastier will follow soon. If you want to do something special and partially ignore SIGHUP then a) you need to coordinate that with whatever program will do the shutdown and b) you need to be ready that if some other system does the shutdown and kills you dead soon after a SIGHUP your software and data will survive. Write out any data you have and only continue writing to append-only files with safe atomic updates.
For your case I'm almost sure your current solution (treat all SIGHUPs as a shutdown) is the correct way to go. If you want to improve things, you should probably add a feature to the shutdown program which does a notify via DBUS or something similar.
When the system shuts down, the rc.d scripts are called.
Maybe you can add a script there that sends some special signal to your program.
However, I doubt you can stop the system shutdown that way.

Resources