Nagios max_check_attempts/retry interval ignored with host checks - nagios

I'm running Nagios 3.2.3 and I have a mysterious issue with host checks. Here's an example host definition.
define host {
host_name HOST
contacts CONTACTS_HERE
alias ALIAS
max_check_attempts 15
check_interval 5
active_checks_enabled 1
passive_checks_enabled 1
check_period 24x7
obsess_over_host 0
retry_interval 1
check_freshness 0
freshness_threshold 120
retain_status_information 1
retain_nonstatus_information 1
low_flap_threshold 0
high_flap_threshold 0
flap_detection_enabled 0
process_perf_data 1
notification_interval 120
notification_period 24x7
notification_options d,u,r
check_command check-host-alive
icon_image_alt Linux
icon_image linux40.png
statusmap_image linux40.gd2
}
As you can see, max_check_attempts is set to 15 and retry_interval is set to 1 minute. The check command looks like this:
define command {
command_name check-host-alive
command_line /usr/lib64/nagios/plugins/check_ping -H $HOSTNAME$ -w 3000.0,80% -c 5000.0,100% -p 1
}
What happens however is this sequence of events:
Host Up[01-30-2017 21:41:56] HOST ALERT: HOST_NAME;UP;HARD;1;PING OK - Packet loss = 0%, RTA = 0.17 ms
Host Down[01-30-2017 21:41:21] HOST ALERT: HOST_NAME;DOWN;HARD;1;PING CRITICAL - Packet loss = 100%
Host Down[01-30-2017 21:41:10] HOST ALERT: HOST_NAME;DOWN;SOFT;1;PING CRITICAL - Packet loss = 100%
So, after the first check failure, the host goes into a hard state instead of trying the check for 15 times 1 minute apart. I should add that this seems to happen when the host is not really down but extremely busy.
Any ideas?
Thanks,
Sergei

Related

Can I just run one port to send packet continuously in Pktgen?

I want to accomplish to send packet continuously in port "0",and I have done some configuraition:
./app/x86_64-native-linuxapp-gcc/pktgen -l 0-2 -n 4 --proc-type auto --socket-mem 1024 -b 00:08.0 -- -P -m "[1].0"
and in the interactive CLI I set something:
Pktgen>set 0 src ip "192.168.12.2/24"
Pktgen>set 0 dst ip "192.168.12.3"
Pktgen>set 0 proto udp
Pktgen>set 0 count 0
Pktgen>set 0 rate 50
Pktgen>set 0 size 64
Pktgen>start 0
But according to the page main display, the port 0 just transmit few packets and stop to send packet anymore,even I stop 0 and start 0 again, there is no any response.
Pktgen must config two dpdk ports? when I config two dpdk ports and run scripts/rfc2544_tput_test.lua, it works well,I want to know why...
Yes, you can. It work seamlessly. Only requirement, the other side should be able to detect link state change.

Cannot sync with the NTP server

I am using lubuntu Linux 18.04 Bionic. When I print ntpq -pn I cannot see that my computer is synced with my desired NTP server.
I have tried several tutorials like here: LINK. I took the NTP servers from Google HERE and included the all 4 servers to my config file.
Then, I did the following things in order to sync with one of the Google NTP servers:
sudo service stop
sudo service ntpdate time1.google.com and received a log ntpdate[2671]: adjust time server 216.239.35.0 offset -0.000330 sec
sudo service start
Here is my /etc/ntp.conf file:
driftfile /var/lib/ntp/ntp.drift
leapfile /usr/share/zoneinfo/leap-seconds.list
statistics loopstats peerstats clockstats
filegen loopstats file loopstats type day enable
filegen peerstats file peerstats type day enable
filegen clockstats file clockstats type day enable
restrict -4 default kod notrap nomodify nopeer noquery limited
restrict -6 default kod notrap nomodify nopeer noquery limited
restrict 127.0.0.1
restrict ::1
restrict source notrap nomodify noquery
server time1.google.com iburst
server time2.google.com iburst
server time3.google.com iburst
server time4.google.com iburst
After doing the steps above, I got this result from ntpq -pn:
remote refid st t when poll reach delay offset jitter
+216.239.35.0 .GOOG. 1 u 33 64 1 36.992 0.519 0.550
+216.239.35.4 .GOOG. 1 u 32 64 1 20.692 0.688 0.612
*216.239.35.8 .GOOG. 1 u 36 64 1 22.233 0.088 1.091
-216.239.35.12 .GOOG. 1 u 32 64 1 33.480 -0.218 1.378
Why my computer is not synced?
EDIT:
Here is my log output after sudo systemctl status ntp.service:
ntp.service - Network Time Service
Loaded: loaded (/lib/systemd/system/ntp.service; enabled; vendor preset: enabled)
Active: active (running) since Thu 2019-01-17 11:37:33 CET; 17min ago
Docs: man:ntpd(8)
Process: 2704 ExecStart=/usr/lib/ntp/ntp-systemd-wrapper (code=exited, status=0/SUCCESS)
Main PID: 2712 (ntpd)
CGroup: /system.slice/ntp.service
└─2712 /usr/sbin/ntpd -p /var/run/ntpd.pid -g -u 105:108
Jan 17 11:37:33 ELAR-Systems ntpd[2712]: proto: precision = 1.750 usec (-19)
Jan 17 11:37:33 ELAR-Systems ntpd[2712]: leapsecond file ('/usr/share/zoneinfo/leap-seconds.list'): good hash
Jan 17 11:37:33 ELAR-Systems ntpd[2712]: leapsecond file ('/usr/share/zoneinfo/leap-seconds.list'): loaded, e
Jan 17 11:37:33 ELAR-Systems ntpd[2712]: Listen and drop on 0 v6wildcard [::]:123
Jan 17 11:37:33 ELAR-Systems ntpd[2712]: Listen and drop on 1 v4wildcard 0.0.0.0:123
Jan 17 11:37:33 ELAR-Systems ntpd[2712]: Listen normally on 2 lo 127.0.0.1:123
Jan 17 11:37:33 ELAR-Systems ntpd[2712]: Listen normally on 3 wlan0 192.168.86.26:123
Jan 17 11:37:33 ELAR-Systems ntpd[2712]: Listen normally on 4 lo [::1]:123
Jan 17 11:37:33 ELAR-Systems ntpd[2712]: Listen normally on 5 wlan0 [fe80::71d6:ec6e:fa92:b53%4]:123
Jan 17 11:37:33 ELAR-Systems ntpd[2712]: Listening on routing socket on fd #22 for interface updates
Your system time actually is getting synced but is running off very quick.
The Raspberry Pi, Arduino, Asus Tinker and the other PCB single-board computers have no onboard RTC (real time clock) and no battery to keep it up constantly. It has nothing to do with RAM or current, but simply the fact that there is no hardware clock on the computer.
On my raspberry pi, the time went off several minutes within an hour.
The "software clock" on the computer is impacted by system load and is very unstable.
An RTC extension (for RPI) looks like this:
(Source: www.robotshop.com)

Nagios with Munin services gives unknown

On an Debian Wheezy updated server, I'm using the backports of the following packages :
Nagios : nagios3 (3.4.1-5~bpo7+1)
Munin : munin (2.0.25-1~bpo70+1)
And nsca (2.9.1-2) to trasmit data from Munin to Nagios in order to process alerts.
Nagios is working fine with the following configured Munin services :
# generic service template definition
define service{
name generic-munin-service ; The 'name' of this service template
use generic-service
check_command return-unknown!"No Data from passive check"
active_checks_enabled 0 ; Active service checks are disabled
passive_checks_enabled 1
parallelize_check 1
notifications_enabled 1
event_handler_enabled 1
is_volatile 1
notification_interval 120
notification_period 24x7
notification_options w,u,c,r
check_freshness 1
freshness_threshold 360
flap_detection_options n
max_check_attempts 2
register 0 ; DONT REGISTER THIS DEFINITION - ITS NOT A REAL SERVICE, JUST A TEMPLATE!
;first_notification_delay 6 ; Delay first notification for false positives (will execute 2 checks : munin sends 1 check every 5 minutes)
}
define service {
hostgroup_name munin
service_description Disk latency per device :: Average latency for /dev/sda
use generic-munin-service
notification_interval 0 ; set > 0 if you want to be renotified
}
define service {
service_description Disk latency per device :: Average latency for /dev/sdb
use generic-munin-service
notification_interval 0 ; set > 0 if you want to be renotified
}
define service {
hostgroup_name munin
service_description Disk usage in percent
use generic-munin-service
notification_interval 0 ; set > 0 if you want to be renotified
}
define service {
hostgroup_name munin
service_description Inode usage in percent
use generic-munin-service
notification_interval 0 ; set > 0 if you want to be renotified
}
define service {
hostgroup_name munin
service_description File table usage
use generic-munin-service
notification_interval 0 ; set > 0 if you want to be renotified
}
But when I add further services which are available on all monitored hosts too, they will be labelled as UNKNOWN in Nagios:
define service {
hostgroup_name munin
service_description Memory usage
use generic-munin-service
notification_interval 0 ; set > 0 if you want to be renotified
}
define service {
hostgroup_name munin
service_description CPU usage
use generic-munin-service
notification_interval 0 ; set > 0 if you want to be renotified
}
I've already found out that depending on the munin plugin graph title format, Nagios may not understand the incoming data, that's why I've updated the packages on the server to the backports version of Wheezy, since Munin 2.0.7 should clean all titles.
I also tried to debug with a higher debug level, and the log shows :
[1434122043] SERVICE ALERT: HostIJZI4;Memory usage;UNKNOWN;HARD;2;INCONNU
But I may need your help for going further.
I suggest you to update your packages, Nagios Core is currently in 4.1.1 and you are using an older version.
They fixed a lot of things, maybe your issue is now patched : https://www.nagios.org/projects/nagios-core/history/4x/

setting up passive checks on nagios

hello board this question may be a little clean and green however,
I've been trying to set up Nagios NSCA for passive checks on a local ubuntu box as a prototype.
for those in the know, my nsca listening on 5667 and send_nsca is on the same ubuntu computer (localhost 127.0.0.1) . I've been reading and testing object definitions and service templates however I have been getting config errors when i try to access nagios web after modifications.
I hope to get clearer instructions on how I can create the service (directories/configurations) to process passive checks in Nagio3 for ubuntu.
There are a few things to consider, firstly that localhost is defined as a host and secondly that the check actually exists as it would for any other check but with a command that doesn't actually do anything, for example.. I've created a passiveservices.cfg file with services defined as follows:
define service{
use generic-service,service-pnp
host_name Server1,Server2
service_description Uptime
active_checks_enabled 1
passive_checks_enabled 1
check_command check_null
check_freshness 1
check_period none
}
define service{
use generic-service,service-pnp
host_name Server1,Server2
service_description Drive space
active_checks_enabled 1
passive_checks_enabled 1
check_command check_null
check_freshness 1
check_period none
Note that the check command is check_null, it's not actually doing anything.. and passive_checks_enabled is 1.
There are two lines within Nagios.cfg which you need to enable:
accept_passive_host_checks
accept_passive_service_checks
It's also a good idea to enable the following two lines aswell
check_service_freshness
check_host_freshness
If a server doesn't poll in after a set amount of time, it'll trigger a script (I trigger an email within my config)
Lastly, enable the following two lines:
log_external_commands
log_passive_checks
They'll help with debugging if this doesn't work. It writes out to /var/log/syslog on Ubuntu (well, it does on mine)..

Why does apache mod_perl process become a zombie?

Occasionally mod_perl apache process is marked "defunct" in "top" utility, that is becomes a zombie process.
Is it a correct behavior?
Do I have to worry about it?
Our Perl script is very simple, it does not spawn any child processes.
The zombie process disappears pretty quickly.
Apache2, Ubuntu.
Our apache config is here: apache_config.txt
Here is a snap-shot of top.
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
19525 www-data 20 0 55972 25m 4684 S 10.3 2.4 0:00.32 apache2
19486 www-data 20 0 52792 21m 4120 S 1.7 2.1 0:00.05 apache2
19538 www-data 20 0 52792 21m 4120 S 1.3 2.1 0:00.04 apache2
19539 www-data 20 0 0 0 0 Z 0.7 0.0 0:00.03 apache2 <defunct>
19481 www-data 20 0 52860 21m 4016 S 0.3 2.1 0:00.05 apache2
19521 www-data 20 0 52804 21m 3824 S 0.3 2.1 0:00.08 apache2
These are CPAN modules I use
CGI();
XML::LibXML();
DateTime;
DateTime::TimeZone;
Benchmark();
Data::Dump();
Devel::StackTrace();
DBD::mysql();
DBI();
LWP();
LWP::UserAgent();
HTTP::Request();
HTTP::Response();
URI::Heuristic();
MD5();
IO::String();
DateTime::Format::HTTP();
Math::BigInt();
Digest::SHA1();
top:
26252 www-data 20 0 0 0 0 Z 0.3 0.0 0:00.22 apache2 <defunct>
access.log with pid logged as the first parameter:
26252 85.124.207.173 - - [26/Dec/2009:22:16:42 +0300] "GET /cgi-bin/wimo/server/index.pl?location=gn:2761369&request=forecast&client_part=app&ver=2_0b191&client=desktop&license_type=free&auto_id=125CC6B6DAA HTTP/1.1" 200 826 0 "-" "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; GTB6.3; Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1) ; .NET CLR 2.0.50727; .NET CLR 3.0.4506.2152; .NET CLR 3.5.30729)"
3 different zombie processes logged by server-status
Srv PID Acc M CPU SS Req ConnChild Slot Client VHost Request
32-0 1300 0/0/45 _ 0.00 0 0 0.0 0.00 2.29 127.0.0.1 weather_server OPTIONS * HTTP/1.0
100-0 1254 1/7/41 C 0.22 0 0 0.0 0.00 1.51 127.0.0.1 weather_server OPTIONS * HTTP/1.0
29-0 1299 0/12/78 _ 0.31 0 2 0.0 0.78 2.37 [my ip was here] weather_server GET /server-status HTTP/1.1
My first suspicion is that you really are forking but maybe you don't realize it. Is it possible for to include your code? Remember that any system or `` calls are forking. This could easily be happening inside a CPAN module without you realizing. There is some useful information about mod_perl and forking (including how zombies are created and how to avoid them) here.
Update: try adding this to your config:
# Monitor apache server status
ExtendedStatus On
<VirtualHost 127.0.0.1:80>
<Location /server-status>
SetHandler server-status
Order deny,allow
Deny from all
Allow from 127.0.0.1
</Location>
</VirtualHost>
And then change the Allow from to be your IP, then you can visit http://yourdomain.com/server-status and get a page of summary information on apache. Try doing this when you see one of the zombies and look to see what apache thinks that process is doing.
I see that too with my very simple mod_perl 2 module. I do not fork anything, just write a string to client socket and then return OK. And still a defunct process appears at my CentOS 5.5 Linux VM and then goes away. Here is my source code and you can test it with "telnet yourhost 843" and pressing ENTER:
package SocketPolicy;
# Run: semanage port -a -t http_port_t -p tcp 843
# And add following lines to the httpd.conf
# Listen 843
# <VirtualHost _default_:843>
# PerlModule SocketPolicy
# PerlProcessConnectionHandler SocketPolicy
# </VirtualHost>
use strict;
use warnings FATAL => 'all';
use APR::Const(-compile => 'SO_NONBLOCK');
use APR::Socket();
use Apache2::ServerRec();
use Apache2::Connection();
use Apache2::Const(-compile => qw(OK DECLINED));
use constant POLICY =>
qq{<?xml version="1.0"?>
<!DOCTYPE cross-domain-policy SYSTEM
"http://www.adobe.com/xml/dtds/cross-domain-policy.dtd">
<cross-domain-policy>
<allow-access-from domain="*" to-ports="8080"/>
</cross-domain-policy>
\0};
sub handler {
my $conn = shift;
my $socket = $conn->client_socket();
my $offset = 0;
# set the socket to the blocking mode
$socket->opt_set(APR::Const::SO_NONBLOCK => 0);
do {
my $nbytes = $socket->send(substr(POLICY, $offset), length(POLICY) - $offset);
# client connection closed or interrupted
return Apache2::Const::DECLINED unless $nbytes;
$offset += $nbytes;
} while ($offset < length(POLICY));
my $slog = $conn->base_server()->log();
$slog->warn('served socket policy to: ', $conn->remote_ip());
return Apache2::Const::OK;
}
1;

Resources