Nagios with Munin services gives unknown - nagios

On an Debian Wheezy updated server, I'm using the backports of the following packages :
Nagios : nagios3 (3.4.1-5~bpo7+1)
Munin : munin (2.0.25-1~bpo70+1)
And nsca (2.9.1-2) to trasmit data from Munin to Nagios in order to process alerts.
Nagios is working fine with the following configured Munin services :
# generic service template definition
define service{
name generic-munin-service ; The 'name' of this service template
use generic-service
check_command return-unknown!"No Data from passive check"
active_checks_enabled 0 ; Active service checks are disabled
passive_checks_enabled 1
parallelize_check 1
notifications_enabled 1
event_handler_enabled 1
is_volatile 1
notification_interval 120
notification_period 24x7
notification_options w,u,c,r
check_freshness 1
freshness_threshold 360
flap_detection_options n
max_check_attempts 2
register 0 ; DONT REGISTER THIS DEFINITION - ITS NOT A REAL SERVICE, JUST A TEMPLATE!
;first_notification_delay 6 ; Delay first notification for false positives (will execute 2 checks : munin sends 1 check every 5 minutes)
}
define service {
hostgroup_name munin
service_description Disk latency per device :: Average latency for /dev/sda
use generic-munin-service
notification_interval 0 ; set > 0 if you want to be renotified
}
define service {
service_description Disk latency per device :: Average latency for /dev/sdb
use generic-munin-service
notification_interval 0 ; set > 0 if you want to be renotified
}
define service {
hostgroup_name munin
service_description Disk usage in percent
use generic-munin-service
notification_interval 0 ; set > 0 if you want to be renotified
}
define service {
hostgroup_name munin
service_description Inode usage in percent
use generic-munin-service
notification_interval 0 ; set > 0 if you want to be renotified
}
define service {
hostgroup_name munin
service_description File table usage
use generic-munin-service
notification_interval 0 ; set > 0 if you want to be renotified
}
But when I add further services which are available on all monitored hosts too, they will be labelled as UNKNOWN in Nagios:
define service {
hostgroup_name munin
service_description Memory usage
use generic-munin-service
notification_interval 0 ; set > 0 if you want to be renotified
}
define service {
hostgroup_name munin
service_description CPU usage
use generic-munin-service
notification_interval 0 ; set > 0 if you want to be renotified
}
I've already found out that depending on the munin plugin graph title format, Nagios may not understand the incoming data, that's why I've updated the packages on the server to the backports version of Wheezy, since Munin 2.0.7 should clean all titles.
I also tried to debug with a higher debug level, and the log shows :
[1434122043] SERVICE ALERT: HostIJZI4;Memory usage;UNKNOWN;HARD;2;INCONNU
But I may need your help for going further.

I suggest you to update your packages, Nagios Core is currently in 4.1.1 and you are using an older version.
They fixed a lot of things, maybe your issue is now patched : https://www.nagios.org/projects/nagios-core/history/4x/

Related

Nagios: Return code of 7 is out of bounds

Services are up and running on the remote nodes. CLI execution returns OK, but in UI it returning CRITICAL with Status Information:'Return code of 7 is out of bounds'
nagios-xxxxxxxx:~# /usr/lib/nagios/plugins/check_tcp -H hostname -p <port> -w 5 -c 10 -t 60
TCP OK - 0.002 second response time on hostname port XXXXXXX|time=0.001642s;5.000000;10.000000;0.000000;60.000000
Can someone help me in fixing it?
Nagios log:
[XXXXXXX] Warning: Return code of 7 for check of service 'XXXXXXX' on host was out of bounds.
[XXXXXXX] Warning: Return code of 7 for check of service 'XXXXXXX' on host was out of bounds.
[XXXXXXX] Warning: Return code of 7 for check of service 'XXXXXXX' on host was out of bounds.
[XXXXXXX] Warning: Return code of 7 for check of service 'XXXXXXX' on host was out of bounds.
[XXXXXXX] Warning: Return code of 7 for check of service 'XXXXXXX' on host was out of bounds.
I fixed these issues.Actually issues are with duplicated service configs on nagios server: location:: /etc/nagios4/objects/services/
Cleard the duplcate service configs from the location and reloaded nagios service.
Issues cleared.
I reproduced this problem on my systems. I have 620 hosts, 7000 services.
When the number of services exceed 6189, all plugins become unusable with "Return code of 7 out of bounds", even if there are just /bin/true command.
The main solution is to set in nagios.cfg:
enable_environment_macros=0
I did not want to do this for a long time, because I have one of plugins which uses nagios ENV variables during building HTML e-mail for notifications.
But I found this solution for its running, you need to set manually necessary ENV for particular plugin in this way:
define command{
command_name notify-html-service
command_line NAGIOS_NOTIFICATIONTYPE='$NOTIFICATIONTYPE$' NAGIOS_SERVICEATTEMPT='$SERVICEATTEMPT$' NAGIOS_SERVICESTATE='$SERVICESTATE$' NAGIOS_CONTACTGROUPNAME='$CONTACTGROUPNAME$' NAGIOS_HOSTNAME='$HOSTNAME$' NAGIOS_SERVICEDESC='$SERVICEDESC$' NAGIOS_LONGSERVICEOUTPUT='$LONGSERVICEOUTPUT$' NAGIOS_HOSTADDRESS='$HOSTADDRESS$' NAGIOS_HOSTGROUPNAMES='$HOSTGROUPNAMES$' NAGIOS_HOSTALIAS='$HOSTALIAS$' NAGIOS_SERVICEOUTPUT='$SERVICEOUTPUT$' NAGIOS_LONGDATETIME='$LONGDATETIME$' NAGIOS_SERVICEDURATION='$SERVICEDURATION$' NAGIOS_NOTIFICATIONRECIPIENTS='$NOTIFICATIONRECIPIENTS$' NAGIOS_SERVICEGROUPALIAS='$SERVICEGROUPALIAS$' NAGIOS_HOSTALIAS='$HOSTALIAS$' NAGIOS_NOTIFICATIONAUTHOR='$NOTIFICATIONAUTHOR$' NAGIOS_NOTIFICATIONCOMMENT='$NOTIFICATIONCOMMENT$' NAGIOS_CONTACTEMAIL='$CONTACTEMAIL$' NAGIOS_SERVICEATTEMPT='$SERVICEATTEMPT$' /usr/bin/perl '$USER7$/send.notify' http://192.168.1.1/nagios 2>/tmp/send.log
}
define command{
command_name notify-html-host
command_line NAGIOS_NOTIFICATIONTYPE='$NOTIFICATIONTYPE$' NAGIOS_HOSTSTATE='$HOSTSTATE$' NAGIOS_CONTACTGROUPNAME='$CONTACTGROUPNAME$' NAGIOS_HOSTNAME='$HOSTNAME$' NAGIOS_HOSTADDRESS='$HOSTADDRESS$' NAGIOS_HOSTGROUPNAMES='$HOSTGROUPNAMES$' NAGIOS_HOSTALIAS='$HOSTALIAS$' NAGIOS_LONGDATETIME='$LONGDATETIME$' NAGIOS_NOTIFICATIONRECIPIENTS='$NOTIFICATIONRECIPIENTS$' NAGIOS_SERVICEGROUPALIAS='$SERVICEGROUPALIAS$' NAGIOS_LONGHOSTOUTPUT='$LONGHOSTOUTPUT$' NAGIOS_HOSTALIAS='$HOSTALIAS$' NAGIOS_HOSTOUTPUT='$HOSTOUTPUT$' NAGIOS_HOSTDURATION='$HOSTDURATION$' NAGIOS_NOTIFICATIONAUTHOR='$NOTIFICATIONAUTHOR$' NAGIOS_NOTIFICATIONCOMMENT='$NOTIFICATIONCOMMENT$' NAGIOS_CONTACTEMAIL='$CONTACTEMAIL$' NAGIOS_SERVICEATTEMPT='' /usr/bin/perl '$USER7$/send.notify' http://192.168.1.1/nagios 2>/tmp/send.log
}
This helped to me. At the beginning it was one command for both notifications, with different host/service ENV vars presetting by nagios:
define command{
command_name notify-html
command_line /usr/bin/perl $USER2$/send.notify http://192.168.1.1/nagios 2>/tmp/send.log
}
By the way, nagios documentation not recommends to set enable_environment_macros=1:
Enabling this is a very bad idea for anything but very small setups,
as it means plugins, notification scripts and eventhandlers may run
out of environment space. It will also cause a significant increase
in CPU- and memory usage and drastically reduce the number of checks
you can run.
PS/
My answer was edited, due to need to split notify-html command to notify-html-host and notify-html-service. I started to receive wrong host notifications due to errors with macros deffinitions (service macros are absent in host notification events), and I had to trace debug log of nagios and saw a lots of 'WARNING: An error occurred processing macro' messages.
Good Luck.
I had this exact same issue but it seems it was due to the number of services tied to a single Servicegroup. Once the Servicegroup had more than nine services reporting they would return:
[XXXXXXX] Warning: Return code of 7 for check of service 'XXXXXXX' on host was out of bounds.
I reorganized my services into a few separate Servicegroups and all the checks functioned normally again without any further adjustment.

Nagios max_check_attempts/retry interval ignored with host checks

I'm running Nagios 3.2.3 and I have a mysterious issue with host checks. Here's an example host definition.
define host {
host_name HOST
contacts CONTACTS_HERE
alias ALIAS
max_check_attempts 15
check_interval 5
active_checks_enabled 1
passive_checks_enabled 1
check_period 24x7
obsess_over_host 0
retry_interval 1
check_freshness 0
freshness_threshold 120
retain_status_information 1
retain_nonstatus_information 1
low_flap_threshold 0
high_flap_threshold 0
flap_detection_enabled 0
process_perf_data 1
notification_interval 120
notification_period 24x7
notification_options d,u,r
check_command check-host-alive
icon_image_alt Linux
icon_image linux40.png
statusmap_image linux40.gd2
}
As you can see, max_check_attempts is set to 15 and retry_interval is set to 1 minute. The check command looks like this:
define command {
command_name check-host-alive
command_line /usr/lib64/nagios/plugins/check_ping -H $HOSTNAME$ -w 3000.0,80% -c 5000.0,100% -p 1
}
What happens however is this sequence of events:
Host Up[01-30-2017 21:41:56] HOST ALERT: HOST_NAME;UP;HARD;1;PING OK - Packet loss = 0%, RTA = 0.17 ms
Host Down[01-30-2017 21:41:21] HOST ALERT: HOST_NAME;DOWN;HARD;1;PING CRITICAL - Packet loss = 100%
Host Down[01-30-2017 21:41:10] HOST ALERT: HOST_NAME;DOWN;SOFT;1;PING CRITICAL - Packet loss = 100%
So, after the first check failure, the host goes into a hard state instead of trying the check for 15 times 1 minute apart. I should add that this seems to happen when the host is not really down but extremely busy.
Any ideas?
Thanks,
Sergei

Vagrant VMs can talk to each other, but I can't reach HTTP from the host

I have 5 VMs on 192.168.56.*:
.19 - Zookeeper
.20 - Solr1
.21 - Solr2
.22 - Solr3
.23 - Solr4
This is my Vagrantfile:
Vagrant.configure(2) do |config|
# The most common configuration options are documented and commented below.
# For a complete reference, please see the online documentation at
# https://docs.vagrantup.com.
# Every Vagrant development environment requires a box. You can search for
# boxes at https://atlas.hashicorp.com/search.
# config.vm.box = "base"
(1..4).each do |x|
ip = 20
config.vm.define "solr#{x}" do |solr|
solr.vm.box = 'ubuntu/wily64'
solr.vm.network "private_network", ip: "192.168.56.#{ip}", bridge: "Intel(R) Centrino(R) Advanced-N 6205"
ip = ip + 1
solr.vm.provider "virtualbox" do |v|
v.memory = 2048
#v.cpus = 1
end
end
end
end
I have Apache HTTP on port 80 and Solr on port 8983. I can do wget 192.168.56.20:8983 from the ZooKeeper VM and it downloads the main page. When I try to hit 192.168.56.20:8983 from the host OS, it just hangs. Firewall rules are in place that open up those ports, so no idea why .19 can access Solr, but the host cannot.
Any ideas?

Problems regarding nagios.cfg with defined services and hosts (generated by Nagiosql)

Nagiosql generated files make problems during preflight check - but everythings seems to be okay.
/etc/nagios/nagios.cfg
....
## Hosts
cfg_dir=/etc/nagiosql/hosts/
cfg_file=/etc/nagiosql/hosttemplates.cfg
cfg_file=/etc/nagiosql/hostgroups.cfg
cfg_file=/etc/nagiosql/hostextinfo.cfg
cfg_file=/etc/nagiosql/hostescalations.cfg
cfg_file=/etc/nagiosql/hostdependencies.cfg
## Services
cfg_dir=/etc/nagiosql/services/
cfg_file=/etc/nagiosql/servicetemplates.cfg
cfg_file=/etc/nagiosql/servicegroups.cfg
cfg_file=/etc/nagiosql/serviceextinfo.cfg
cfg_file=/etc/nagiosql/serviceescalations.cfg
cfg_file=/etc/nagiosql/servicedependencies.cfg
...
nagios -v /etc/nagios/nagios.cfg
....
Running pre-flight check on configuration data...
Checking services...
Error: There are no services defined!
Checked 0 services.
Checking hosts...
Error: There are no hosts defined!
Checked 0 hosts.
The content seems okay to me
[root#xxx services]# cd /etc/nagiosql/services/
[root#xxx services]# ls -alh
total 20K
drwsr-sr-x 2 apache nagios 4.0K Aug 7 10:46 .
drwsr-sr-x 5 apache nagios 4.0K Aug 7 12:17 ..
-rw-r--r-- 1 apache nagios 2.3K Aug 7 10:46 localhost.cfg
-rw-r--r-- 1 apache nagios 2.2K Aug 7 10:46 www.google.com.cfg
-rw-r--r-- 1 apache nagios 1.1K Aug 7 10:46 www.yahoo.com.cfg
[root#xxx hosts]# ls -alh
total 16K
drwsr-sr-x 2 apache nagios 4.0K Aug 11 07:12 .
drwsr-sr-x 5 apache nagios 4.0K Aug 7 12:17 ..
-rw-r--r-- 1 apache nagios 800 Aug 11 07:12 GIT.cfg
-rw-r--r-- 1 apache nagios 948 Aug 11 07:12 psm01.cfg
Content also seems to be fine (generated by nagiosql):
[root#xxx hosts]# vi GIT.cfg
###############################################################################
#
# Host configuration file
#
# Created by: Nagios QL Version 3.2.0
# Date: 2015-08-11 07:12:54
# Version: Nagios 3.x config file
#
# --- DO NOT EDIT THIS FILE BY HAND ---
# Nagios QL will overwite all manual settings during the next update
#
###############################################################################
define host {
host_name GIT
alias GIT Server
address 172.25.10.80
register 0
}
###############################################################################
#
# Host configuration file
#
# END OF FILE
#
###############################################################################
~
Can somebody tell me where the solution for this problem is? Already wasted 2 hours...
Try removing the final slash from the directory names in your cfg_dir definitions and see if that doesn´t get it to recognize the cfg files in that directory.
For example,
Change:
cfg_dir=/etc/nagiosql/hosts/
...
cfg_dir=/etc/nagiosql/services/
To:
cfg_dir=/etc/nagiosql/hosts
...
cfg_dir=/etc/nagiosql/services
EDIT:
Okay I think directory permissions may be causing the cfg_dir evaulations to fail. According to the ls -alh output you listed, your /etc/nagiosql/hosts/, /etc/nagiosql/services/, and /etc/nagiosql/ directories do not grant write permissions to the nagios group. Nagios will need to get a directory listing for those directories and will need group write permissions to do it.
To remedy:
chmod g+w /etc/nagiosql/hosts/
chmod g+w /etc/nagiosql/services/
Restart nagios service.
Also, you don't need to remove the slashes from the directory paths in the nagios cfg_dir configurations. Nagios will strip the trailing slash (/) for you, according to the code:
https://github.com/NagiosEnterprises/nagioscore/blob/eb8e83d5d05e572eb8c0d4d4764885c5427b4b69/xdata/xodtemplate.c#L327
/* process all files in a config directory */ else if(!strcmp(var, "xodtemplate_config_dir") || !strcmp(var, "cfg_dir")) {
if(config_base_dir != NULL && val[0] != '/') {
asprintf(&cfgfile, "%s/%s", config_base_dir, val);
} else
cfgfile = strdup(val);
/* strip trailing / if necessary */ if(cfgfile != NULL && cfgfile[strlen(cfgfile) - 1] == '/')
cfgfile[strlen(cfgfile) - 1] = '\x0';
/* process the config directory... */ result = xodtemplate_process_config_dir(cfgfile, options);
my_free(cfgfile);
/* if there was an error processing the config file, break out of loop */ if(result == ERROR)
break; } }
EDIT #2: In the host definition you posted, your register value is set to 0. Try setting it to 1 instead. register 0 is used for templates that will be inherited from, but will not actually show up in the Nagios UI.
Change:
define host {
host_name GIT
alias GIT Server
address 172.25.10.80
register 0
}
To:
define host {
host_name GIT
alias GIT Server
address 172.25.10.80
register 1
}
Also please set register 1 for your service definitions as well.
Try adding executable permissions to your directories. Some programs and languages require +x permissions in order to actually open the directory.
If that doesn't work, temporarily set everything to 0777 permissions to see if the issue is permissions related at all.
You also have config problems even if you get that part working. Your host and service configs don't have a use directive in them, which points to a template that would have most of the default values. The register directive is implied as 1 unless you specifically set it to 0 for a template. See the object definitions docs if you need a reference: https://assets.nagios.com/downloads/nagioscore/docs/nagioscore/3/en/objectdefinitions.html

setting up passive checks on nagios

hello board this question may be a little clean and green however,
I've been trying to set up Nagios NSCA for passive checks on a local ubuntu box as a prototype.
for those in the know, my nsca listening on 5667 and send_nsca is on the same ubuntu computer (localhost 127.0.0.1) . I've been reading and testing object definitions and service templates however I have been getting config errors when i try to access nagios web after modifications.
I hope to get clearer instructions on how I can create the service (directories/configurations) to process passive checks in Nagio3 for ubuntu.
There are a few things to consider, firstly that localhost is defined as a host and secondly that the check actually exists as it would for any other check but with a command that doesn't actually do anything, for example.. I've created a passiveservices.cfg file with services defined as follows:
define service{
use generic-service,service-pnp
host_name Server1,Server2
service_description Uptime
active_checks_enabled 1
passive_checks_enabled 1
check_command check_null
check_freshness 1
check_period none
}
define service{
use generic-service,service-pnp
host_name Server1,Server2
service_description Drive space
active_checks_enabled 1
passive_checks_enabled 1
check_command check_null
check_freshness 1
check_period none
Note that the check command is check_null, it's not actually doing anything.. and passive_checks_enabled is 1.
There are two lines within Nagios.cfg which you need to enable:
accept_passive_host_checks
accept_passive_service_checks
It's also a good idea to enable the following two lines aswell
check_service_freshness
check_host_freshness
If a server doesn't poll in after a set amount of time, it'll trigger a script (I trigger an email within my config)
Lastly, enable the following two lines:
log_external_commands
log_passive_checks
They'll help with debugging if this doesn't work. It writes out to /var/log/syslog on Ubuntu (well, it does on mine)..

Resources