Unable to fetch port count on nagios - nagios

I have formed a command for fetching established port connection using nagios check_by_ssh module.
I am able to get the output when I run the command, however after placing the command in the commands.cfg file I am seeing "check_by_ssh: skip-stderr argument must be an integer " in the GUI. Any suggestion on this would be of great help.
Command:
/usr/local/nagios/libexec/check_by_ssh -l fuseadmin -H <hostname> -C "netstat -punta | grep -i ESTABLISHED | wc -l | awk '{if (\$0>2500) {print \"CRITICAL: Established Socket Count: \"\$0} else {print \"OK: Established Socket Count: \"\$0}}'" -i ~/.ssh/id_dsa -E
OK: Established Socket Count: 67
Commands.cfg:
define command {
command_name netstat_cnt_estanblished_gt_2500_fuse01
command_line /usr/local/nagios/libexec/check_by_ssh -l fuseadmin -H a0110pcsgesb01 -C "netstat -punta | grep -i ESTABLISHED | wc -l 2>&1 | awk '{if (\$0>2500) {print \"CRITICAL: Established Socket Count: \"\$0} else {print \"OK: Established Socket Count: \"\$0}}'" -i ~/.ssh/id_dsa -E
}
Service Definition
#netstat_cnt_estanblished_gt_2500_csg2.0
define service{
use generic-service ; Name of service template to use
host_name <hostname>
service_description Netstat Established Count
event_handler send-service-trap-fms
event_handler_enabled 1
check_command netstat_cnt_estanblished_gt_2500_fuse01
max_check_attempts 1
notifications_enabled 1 ; Service notifications are enabled
check_period 24x7 ; The service can be checked at any time of the day
max_check_attempts 3 ; Re-check the service up to 3 times in order to determine its final (hard) state
check_interval 2 ; Check the service every 10 minutes under normal conditions
retry_interval 2 ; Re-check the service every two minutes until a hard state can be determined
contact_groups fuse_users ; Notifications get sent out to everyone in the 'admins' group
notification_options w,u,c,r ; Send notifications about warning, unknown, critical, and recovery events
notification_interval 30 ; Re-notify about service problems every hour
notification_period 24x7
}
**I have changed the actual hostname to due to compliance

here it says:
check_by_ssh: print command output in verbose mode
right now it is not possible to print the command output of ssh. check_by_ssh
only prints the command itself. This patchs adds printing the output too. This
makes it possible to use ssh with verbose logging which helps debuging any
connection, key or other ssh problems.
Note: you must use -E,--skip-stderr=<high number>, otherwise check_by_ssh would
always exit with unknown state.
Example:
./check_by_ssh -H localhost -o LogLevel=DEBUG3 -C "sleep 1" -E 999 -v
Meaning: you should just have to specify a number after "-E", like -E 999, in your definition (like the example in above code-block says)
... even though, it's confusing (maybe a bug?), because the command help of check_by_ssh says:
-E, --skip-stderr[=n]
Ignore all or (if specified) first n lines on STDERR [optional]

Related

In Bash script trying to pass local variable to SSH and then execute the other commands

#!/bin/bash
count2=1
declare -a input
input=( "$#" )
echo " "
echo " Hostname passed by user is " ${input[0]}
HOST="${input[0]}"
sshpass -p '<pass>' ssh -o StrictHostKeyChecking=no user#$HOST /bin/bash << ENDSSH
echo " Connected "
echo $count2
echo $input
pwd
echo $count2: ${input[$count2]}
nic=${input[$count2]}
echo $nic
echo $(ethtool "${nic}" |& grep 'Link' | awk '{print $3}')
ENDSSH
So Actually want to pass variable 'count2' and 'input' to remote SSH and execute.
But unfortunately it is not getting passed. It is not echoing anything after SSH.
Need help in this.!!
I have sshpass installed in sever.
code output:
[user#l07 ~]$ ./check.sh <hostname> eno6
Hostname passed by user is <hostname>
Connected
After SSH it only echos "Connected". I'm not sure why $count2 and $input is not echoing.
I tired with backlash '\$count2' but that is also not working. All possible combination tried even with quote and unquote of ENDSSH. Pls help
Any help will be really appreciated!!
You basically want to supply to your remote bash a HERE-document to be executed. This is tricky, since you need to "compose" the full text of this document before you can supply it to ssh. I would therefore separate the task into two parts:
Creating the HERE-document
Running it on ssh
This makes it easy for debugging to output the document between steps 1 and 2 and to visually inspect its contents for correctness. Don't forget that once this code runs on the remote host, it can't access any of your variables anymore, unless you have "promoted" them to the remote side using the means provided by ssh.
Hence you could start like this:
# Create the parameters you want to use
nic=${input[$count2]}
# Create a variable holding the content of the remote script,
# which interpolates your parameters
read -r -d '' remote_script << ENDSSH
echo "Connected to host \$(hostname)"
echo "Running bash version: \$BASH_VERSION"
....
ethtool "$nic" |& grep Link | awk '{ print $3 }'
ENDSSH
# Print your script for verification
echo "$remote_script"
# Submit it to the host
sshpass -p '<pass>' ssh -o StrictHostKeyChecking=no "user#$HOST" /bin/bash <<<"$remote_script"
You have to add escapes(\) here:
...
echo \$nic
...
echo \$(ethtool "\${nic}" |& grep 'Link' | awk '{print \$3}')
...
But why echoing this? Try it without echo
...
ethtool "\${nic}" |& grep -i 'Link' | awk '{print \$3}'
...
#!/bin/bash
count2=1
declare -a input
input=( "$#" )
echo " Hostname passed by user is " "${input[0]}"
HOST="${input[0]}"
while [ $# -gt $count2 ]
do
sed -i 's/VALUE/'"${input[$count2]}"'/g' ./check.sh
sshpass -p '<pass>' scp ./check.sh user#"$HOST":/home/user/check.sh
sshpass -p '<pass>' ssh -o StrictHostKeyChecking=no user#"$HOST" "sh /home/user/check.sh && rm -rf /home/user/check.sh"
sed -i 's/'"${input[$count2]}"'/VALUE/g' ./check.sh
((count2++))
done
Found the another solution of this issue: It is working for me now !!!!
I wrote my entire logic which needs to be executed remotely in check.sh file and now replacing or storing the user input into this check.sh file and copying this file into remote server via scp and executing it over remotely and after successful execution removing this file from remote server and after ssh , again changing the user input to it's original value in local server using sed command.
Made this as dynamic script to work for multiple servers.

How to output hostname in "service Description" for nagios core?

I have currently the following two service defined as below:
define service {
use my-webapp-service
hostgroup_name all
service_description System check - PING
check_command check_ping!100.0,20%!500.0,60%
}
define service {
use my-webapp-service
hostgroup_name all
service_description System check - Swap Usage
check_command check_nrpe!check_swap
check_interval 1
}
What I want is output string to be:
System check - PING - "Actual hostname where this alarm got fired off"
System check - Swap Usage - "Actual hostname where this alarm got fired off"
I think this could be possible but I just don't know how to make it possible.
Would sincerely appreciate your guidance on that.
Many Thanks
Output are handled by scripts. Default behavior is that script donĀ“t return hostname, because it is not necessary.
If you wanna add hostname in output, you must edit already exist scripts or create new one.
Here is basic info how create script for Nagios - http://www.yourownlinux.com/2014/06/how-to-create-nagios-plugin-using-bash-script.html
For your needs you must add $HOSTNAME to echo. For instance:
echo "$HOSTNAME - WARNING- $output"
If you want the script that is executing to be aware of the hostname, you'll need to pass the hostname as an argument to the Nagios command. That also means that the script will need to accept the hostname as an argument. Take for example:
define service {
use my-webapp-service
hostgroup_name all
service_description System check - PING
check_command check_ping!100.0,20%!500.0,60%
}
check_ping probably looks something like:
define command {
command_name check_ping
command_line $USER1$/check_ping -H $HOSTADDRESS$ -w $ARG1$ -c $ARG2$ -p 5
}
The problem here is that the executable at $USER1$/check_ping doesn't know that you want to pass the host's name as an argument. So you'll need to make a wrapper script. I'm not going to write the script for you, but to give you a hint, the command definition would look something like:
define command {
command_name check_ping_print_hostname
command_line $USER1$/my_check_ping_wrapper.sh --hostname $HOSTNAME$ -H $HOSTADDRESS$ -w $ARG1$ -c $ARG2$ -p 5
}
And then the script at $USER1$/my_check_ping_wrapper.sh is obviously going to need grab that --hostname argument, and then pass the other arguments directly to check_ping, wait for the output, and then amend the output with the information given in the --hostname arg.
Hope this helps!

Nagios: config ping times

I am using nagios ver. 4.0.8 .
I want to set interval between ping times is 10 seconds like below:
define command{
command_name check-host-alive
command_line $USER1$/check_ping -t 10 -H $HOSTADDRESS$ -w 3000.0,80% -c 5000.0,100% -p 5
}
But not exactly (about 90 seconds). Can you help me?
Thanks
You are looking at things the wrong way.
define command{
command_name check-host-alive
command_line $USER1$/check_ping -t 10 -H $HOSTADDRESS$ -w 3000.0,80% -c 5000.0,100% -p 5
}
The thing you post is a Nagios command. the '-t 10' is not the interval, but the timeout argument. This means if the check_ping does not get a result in 10 seconds, the command will timeout.
To define the check interval, you need to look at the host (or service) configuration file.
For example:
define host {
host_name bogus-router
alias Bogus Router #1
address 192.168.1.254
parents server-backbone
check_command check-host-alive
check_interval 5
retry_interval 1
max_check_attempts 5
check_period 24x7
process_perf_data 0
retain_nonstatus_information 0
contact_groups router-admins
notification_interval 30
notification_period 24x7
notification_options d,u,r
}
The interval between checks in this example is 5 minutes (check_interval). It is not possible to set intervals of less then one minute with Nagios. If you want to have more granular (free) monitoring, check out InfluxDB, Telegraf and Grafana.

Nagios Monitoring Hosts with check_ping

I've deployed a new instance of Nagios on a fresh install of CentOS 7 via the EPEL repository. So the Nagios Core version is 3.5.1.
After installing nagios and nagios-plugins-all (via yum), I've created a number of hosts and service definitions, have tested my configuration with nagios -v /etc/nagios/nagios.cfg, and have Nagios up and running!
Unfortunately, my host checks are failing (although my service checks are working perfectly fine).
Within the Nagios Web GUI / Dashboard, if I drill down into a Host page with the "Host State Information", I see this being reported for "Status Information" (IP address removed):
Status Information: /usr/bin/ping -n -U -w 30 -c 5 {my-host-ip-address}
CRITICAL - Could not interpret output from ping command
So in my troubleshooting, I drilled down into the Nagios Plugins directory (/usr/lib64/nagios/plugins), and ran a test with the check_ping plugin consistent with the way check-host-alive runs the command (see below for my check-host-alive command definition):
./check_ping -H {my-ip-address} -w 3000.0,80% -c 5000.0,100% -p 5
This check_ping command returns the following output:
PING OK - Packet loss = 0%, RTA = 0.63
ms|rta=0.627000ms;3000.000000;5000.000000;0.000000 pl=0%;80;100;0
I haven't changed the definition of how check_ping works, and can confirm that I'm getting a "PING OK" whenever the command is run the same way that check-host-alive runs the command, so I cannot figure out what's going on!
Below are the command definitions for check-host-alive as well as check_ping.
# 'check-host-alive' command definition
define command{
command_name check-host-alive
command_line $USER1$/check_ping -H $HOSTADDRESS$ -w 3000.0,80% -c 5000.0,100% -p 5
}
{snip}
# 'check_ping' command definition
define command{
command_name check_ping
command_line $USER1$/check_ping -H $HOSTADDRESS$ -w $ARG1$ -c $ARG2$ -p 5
}
Any suggestions on how I can fix my check-host-alive command definition to work properly and evaluate the output of check_ping properly?
Edit
Below is the full define host {} template I'm using:
define host {
host_name myers ; The name of this host template
alias Myers
address [redacted]
check_command check-host-alive
contact_groups admins
notifications_enabled 0 ; Host notifications are enabled
event_handler_enabled 1 ; Host event handler is enabled
flap_detection_enabled 1 ; Flap detection is enabled
failure_prediction_enabled 1 ; Failure prediction is enabled
process_perf_data 1 ; Process performance data
retain_status_information 1 ; Retain status information across program restarts
retain_nonstatus_information 1 ; Retain non-status information across program restarts
notification_period 24x7 ; Send host notifications at any time
register 1
max_check_attempts 2
}
For anyone else who runs into this issue, there's another option than changing permissions on ping. Simply change the host check command to use check_host rather than check_ping. While there are certainly some differences in the functionality, the overall end result is the same.
There are those who will say this isn't a good option because of the ability to range the check_ping command, but it should be remembered that host checks aren't even executed until all service checks for a given host have failed. Anyway, if you're interested in testing throughput, there are MUCH better ways of going about it than relying on ICMP, which is the lowest priority traffic type on a network.
I'm sure the OP is well on to other things by now, but hopefully someone else who has this issue will benefit.
I could not found the ping on /usr/bin/ping
# chmod u+s /bin/ping
# ls -al /bin/ping
-rwsr-xr-x 1 root root 40760 Sep 26 2013 /bin/ping*
Finally run the below command,
/usr/local/nagios/libexec/check_ping -H 127.0.0.1 -w 100.0,20% -c 500.0,60% -p 5
I was fairly certain that running chmod U+s /usr/bin/ping would solve the issue, but I was (and still am) wary about chmod'ing system files. It seems to me that there has to be a safer way to do it.
However, in the end, that's what I did - and it works. I don't like it, from a security standpoint.
I also had same problem and the above answers did not work for me. After some checking the issue further noticed that the reason is IP protocol. once I passed the correct IP protocol , It worked fine.
/usr/local/nagios/libexec/check_ping -H localhost -w 3000.0,80% -c 5000.0,100% -4
output
PING OK - Packet loss = 0%, RTA = 0.05 ms|rta=0.051000ms;3000.000000;5000.000000;0.000000 pl=0%;80;100;0
By default It's getting IPv6.
/usr/local/nagios/libexec/check_ping -H localhost -w 3000.0,80% -c 5000.0,100% -6
output
/sbin/ping6 -n -U -W 30 -c 5 localhost
CRITICAL - Could not interpret output from ping command
But when integrating with Nagios server, I could not able to pass this value as an argument. Therefore I have done below workaround in client side nrpe.cfg file
command[check_ping_args]=/usr/local/nagios/libexec/check_ping -H $ARG1$ -w $ARG2$ -c $ARG3$ -4
Here Host, warning and critical thresholds were passing by Nagios host as below,
define service{
use generic-service
hostgroup_name all-servers
service_description Host Ping Status
check_command check_nrpe_args!check_ping_args!localhost!3000.0,80%!5000.0,100%
}

Find original owning process of a Linux socket

In Linux and other UNIX-like operating systems, it is possible for two (or more) processes to share an Internet socket. Assuming there is no parent-child relationship between the processes, is there any way to tell what process originally created a socket?
Clarification: I need to determine this from "outside" the processes using the /proc filesystem or similar. I can't modify the code of the processes. I can already tell what processes are sharing sockets by reading /proc/<pid>/fd, but that doesn't tell me what process originally created them.
You can use netstat for this. You should look in the columns 'Local Address' and 'PID/Program name'.
xxx#xxx:~$ netstat -tulpen
(Not all processes could be identified, non-owned process info
will not be shown, you would have to be root to see it all.)
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address Foreign Address State User Inode PID/Program name
tcp 0 0 127.0.0.1:4005 0.0.0.0:* LISTEN 1000 68449 7559/sbcl
tcp 0 0 0.0.0.0:6000 0.0.0.0:* LISTEN 0 3938 -
tcp6 0 0 :::6000 :::* LISTEN 0 3937 -
udp 0 0 0.0.0.0:68 0.0.0.0:* 0 4528 -
doesn't 'lsof -Ua' help?
You can likely find the shared sockets by parsing /proc/net/tcp (and similar "files" for other protocols). There's some docs on /proc/net/tcp here.
You would need to find the socket (perhaps by its IP addresses/port numbers ?) and parse out the inode number. Once you have the inode, you can search through all of /proc/*/fd/* , calling stat for every link and inspect the st_ino member of struct stat until you find a match.
The inode number should match between the 2 processes, so when you've gone through all /proc/*/fd/* you should have found them both.
If what you do know is the process id and socket fd of the first, you might not need to go through /proc/net/tcp, all you need to do is stat the /proc/<pid>/fd/<fd> and search the rest of /proc/*/fd/* for a matching inode. You'd need /proc/net/tcp if you want to fetch the ip addresses/port number though - which you can find if you know the inode number
For purposes creating a test case, consider a situation where multiple ssh-agent processes are running and have open sockets. I.e. A user runs ssh-agent multiple times and loses the socket/PID information given when the agent started:
$ find /tmp -path "*ssh*agent*" 2>/dev/null
/tmp/ssh-0XemJ4YlRtVI/agent.14405
/tmp/ssh-W1Tl4i8HiftZ/agent.21283
/tmp/ssh-w4fyViMab8wr/agent.10966
Later, the user wants to programmatically determine the PID owner of a particular ssh-agent socket (i.e. /tmp/ssh-W1Tl4i8HiftZ/agent.21283):
$ stat /tmp/ssh-W1Tl4i8HiftZ/agent.21283
File: '/tmp/ssh-W1Tl4i8HiftZ/agent.21283'
Size: 0 Blocks: 0 IO Block: 4096 socket
Device: 805h/2053d Inode: 113 Links: 1
Access: (0600/srw-------) Uid: ( 4000/ myname) Gid: ( 4500/ mygrp)
Access: 2018-03-07 21:23:08.373138728 -0600
Modify: 2018-03-07 20:49:43.638291884 -0600
Change: 2018-03-07 20:49:43.638291884 -0600
Birth: -
In this case, because ssh-agent named its socket nicely as a human onlooker can guess that the socket belongs to PID 21284, because the socket name contains a numeric component that is one-off from a PID identified with ps:
$ ps -ef | grep ssh-agent
myname 10967 1 0 16:54 ? 00:00:00 ssh-agent
myname 14406 1 0 20:35 ? 00:00:00 ssh-agent
myname 21284 1 0 20:49 ? 00:00:00 ssh-agent
It seems highly unwise to make any assumption that the PIDs will be so reliable as to always only be off by one, but also, one might suppose that not all socket creators will name the sockets so nicely.
#Cypher's answer points to a straightforward solution to the problem of identifying the PID of the socket owner, but is incomplete as lsof actually can only identify this PID with elevated permissions. Without elevated permissions, no results are forthcoming:
$ lsof /tmp/ssh-W1Tl4i8HiftZ/agent.21283
$
With elevated permissions, however, the PID is identified:
$ sudo lsof /tmp/ssh-W1Tl4i8HiftZ/agent.21283
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
ssh-agent 21284 myname 3u unix 0xffff971aba04cc00 0t0 1785049 /tmp/ssh-W1Tl4i8HiftZ/agent.21283 type=STREAM
In this case, the owner of the PID (myname) and socket was the one doing the query, so it seemed elevated permissions should not be needed. Furthermore, the task performing the query was not supposed to be able to elevate permissions, so I looked for another answer.
This led me to #whoplisp's answer proposing netstat -tulpen as a solution to the OP's problem. While it may have been effective for the OP, the command line is too restrictive to serve as a general purpose command and was completely ineffective in this case (even with elevated permissions).
$ sudo netstat -tulpen | grep -E -- '(agent.21283|ssh-agent)'
$
netstat, however, can come close if a different command-line is used:
$ netstat -ap | grep -E -- '(agent.21283)'
(Not all processes could be identified, non-owned process info will not be shown, you would have to be root to see it all.)
unix 2 [ ACC ] STREAM LISTENING 1785049 - /tmp/ssh-W1Tl4i8HiftZ/agent.21283
Sadly, here too, the PID is elusive without elevated permissions:
$ sudo netstat -ap | grep -E -- '(agent.21283|ssh-agent)'
unix 2 [ ACC ] STREAM LISTENING 1765316 10967/ssh-agent /tmp/ssh-w4fyViMab8wr/agent.10966
unix 2 [ ACC ] STREAM LISTENING 1777450 14406/ssh-agent /tmp/ssh-0XemJ4YlRtVI/agent.14405
unix 2 [ ACC ] STREAM LISTENING 1785049 21284/ssh-agent /tmp/ssh-W1Tl4i8HiftZ/agent.21283
Of the two solutions, however, lsof clearly wins at the races:
$ time sudo netstat -ap | grep -E -- '(agent.21283|ssh-agent)' >/dev/null
real 0m5.159s
user 0m0.010s
sys 0m0.019s
$ time sudo lsof /tmp/ssh-W1Tl4i8HiftZ/agent.21283 >/dev/null
real 0m0.120s
user 0m0.038s
sys 0m0.066s
Yet another tool exists according to the netstat man page:
$ man netstat | grep -iC1 replace
NOTES
This program is mostly obsolete. Replacement for netstat is ss. Replacement for netstat -r is ip route. Replacement for netstat -i
is ip -s link. Replacement for netstat -g is ip maddr.
Sadly, ss also requires elevated permissions to identify the PID, but, it beats both netstat and lsof execution times:
$ time sudo ss -ap | grep -E "(agent.21283|ssh-agent)"
u_str LISTEN 0 128 /tmp/ssh-w4fyViMab8wr/agent.10966 1765316 * 0 users:(("ssh-agent",pid=10967,fd=3))
u_str LISTEN 0 128 /tmp/ssh-0XemJ4YlRtVI/agent.14405 1777450 * 0 users:(("ssh-agent",pid=14406,fd=3))
u_str LISTEN 0 128 /tmp/ssh-W1Tl4i8HiftZ/agent.21283 1785049 * 0 users:(("ssh-agent",pid=21284,fd=3))
real 0m0.043s
user 0m0.018s
sys 0m0.021s
In conclusion, it might seem that for some PID identification, it appears that elevated permissions are required.
Note: Not all operating systems require elevated permissions. For example, SCO Openserver 5.0.7's lsof seemed to work just fine without elevating permissions.
Caveat: This answer may fail with respect to the OP's qualification for finding "the original creator" of the socket. In the example used, no doubt PID 21283 was the originator of the socket's creation as this PID is identified in the socket name. Neither lsof nor netstat identified PID 21283 as the original creator, though clearly PID 21284 is the current maintainer.

Resources