Nagios conditional checks - nagios

Currently I am monitoring my target windows hosts for a bunch of services (CPU, memory, disks, ssl certs, http etc). I'm using nsclient as the client that the nagios server will talk to.
My problem is that I deploy to those hosts three times every 24 hours. The deployment process requires the hosts to reboot. Whenever my hosts reboot I get nagios alerts for each service. This means a large volume of alerts, which makes it difficult to identify real issues.
Ideally I'd like to this:
If the host is down, don't send any alerts for the rest of the services
If the host is rebooting, this means that nsclient is not accessible. I want to only receive one alert (e.g CPU is not accessible) and mute everything else for a few minutes, so the host can finish booting and nsclient becomes available.
Implementing this would have me getting one email per host for each deployment. This is much better than everything turning red and me getting flooded with alerts that aren't worth checking (since they're only getting sent because the nagios client -nsclient- is not available during the reboot).
Got to love using a windows stack...

There are several ways to handle this.
If your deploys happens at the same time everyday:
1. you could modify your active time period to exclude those times (or)
2. schedule down time for your host via the Nagios GUI
If your deployments happen at different/random times, things become a bit harder to work-around:
1. when nrpe or nsclient is not reachable, Nagios will often throw an 'UNKNOWN' alert for the check. If you remove the 'u' option for the following entries:
host_notification_options [d,u,r,f,s,n]
service_notification_options [w,u,c,r,f,s,n]
That would prevent the 'UNKNOWN's from sending notifications. (or)
2. dynamically modify active checking of the impacted checks, by 'turning them off' before you start the deployment, and then 'turning them on' after the deployment. This can be automated using the Nagios 'external commands file'.

Jim Black's answer would work, or if you want to go even more in depth you can define dependencies with service notification escalation as described in the documentation below.
Escalating the alerts would mean that you could define: CPU/ssl etc check fail -> check host down -> Notifiy/don't notify.
Nagios Service Escalation (3.0)

Related

Nagios Core Summary Macro has Wrong Count

I believe the following summary macros are not accounting for Passive Services:
$TOTALSERVICESCRITICALUNHANDLED$ (This is the one with which I see the problem directly)
And I assume the following two have the same issue:
$TOTALSERVICESWARNINGUNHANDLED$
$TOTALSERVICESUNKNOWNUNHANDLED$
Passive Services who are NOT in downtime and NOT Acknowledged rightfully show up in the Unhandled Services page of Nagios Core.
But, a script Im using spits out the value of $TOTALSERVICESCRITICALUNHANDLED$ that does not account for passive services who are non-downtime, non-ack, and in a critical state.
The wordage on this macro indicates that the service must have 'checks enabled', but this probably does not account for passive checks?:
"
This macro reflects the total number of services that are currently in a CRITICAL state that are not currently being "handled". Unhandled services problems are those that are not acknowledged, are not currently in scheduled downtime, and for which checks are currently enabled.
"
My setup:
I have a command that is executed by a regularly scheduled service. The command passes the value of macro $TOTALSERVICESCRITICALUNHANDLED$ to a script.
The script just echos the value of that macro.
Test:
All services are in downtime except my passive service who has Passive Services Enabled and is in a critical state. The script is spitting out "0" for number of unhandled critical alerts (this is incorrect!)
Enable active checks on the passive service and the script now tells me "1"
Nagios Core Version 4.3.2
Please advise whether this is a bug that was addressed in a later version or whether there is any workaround for me?
I have seen this related issue which was fixed in 4.2.2 but is a different issue: viewtopic.php?t=39957
I ended up making this change to the source code. I can assume if a service is in downtime or is Acknowledged it will already be not counted in the total alarm count. So the check for checks_enabled is redundant and incorrectly throws out passive services (seems checks_enabled is a flag that only represents ACTIVE checks)
common/macros.c starting at line 1216:
Comment out the 3 instances, 2 lines each, where it checks
"if(temp_service->checks_enabled == FALSE) problem = FALSE
(And then rebuild Nagios Core)
The only way I can see this coming back to bite me is if theres a case where active services had their active checks disabled and also were not in Downtime or Acknowledged

Connection tab of Google Cloud SQL instance taking forever to load the console interface

I want to access my cloud database from my computer but the connection tab cannot load to finish so that I can enter my IPv6 address. This is the second time am experiencing this issue and my network is strong enough. It's now been 20 minutes, but still the three dots are just indicating progress that never ends.
The first time it happened I had to leave my computer and go for a walk. This really frustrates me since it's in production and rapid updates should not be delayed.
How can I fix this?
POSSIBLE CAUSE:
It happens after I re-open Mysql-workbench and it fails reason being my IPv6 has been changed possibly by my Internet Service Provider (ISP) (I dont know of other possible reasons). After Mysql-workbench fails, I go to the console to update the new one but this problem occurs.
I think Cloud SQL security (don't know exact name) is treating this a malicious access attempt hence initiating this weird delay for immediate subsequent access. If so, then this is purely impractical for b/s since my computer does not tell me that my IPv6 has changed, besides, that normal regular IPv6 updates can't be treated as malicious lest developers continue to suffer this issue.
EDIT: This time it finished loading after approximately 50 minutes.
Have you considered using the Cloud SQL proxy to connect to your instance instead of white-listing an IP? White-listing an IP can be insecure since it provides anyone on your network access, and inconvenient (as you have discovered) because if your IP changes you lose access.
The proxy uses a service account to provide authenticated access to your instance, so it will work regardless of your IP (as long as your service account has the correct permissions). Check out these instructions for a guide on starting it up.
(As a side note, it's a difficult problem to tell why your connectivity tab is failing to get load. It might be a browser add on or even a networking failure in your local network that is interfering. You can check the browser dev console to see if any errors appear)

How do I get my Domain Controllers to sync with a correct external time source?

I had a user contact me saying that her computer clock is 8 or 9 minutes faster than her cell phone clock. That concerned me because cell phone clocks are always synced. I looked at my computer's clock, and it was the same, about 8 minutes ahead of my phone. Eight minutes is a lot of time to be off. So I looked at my two DC's. The one that serves as the AD PDC Emulator is only 1 minute faster than my phone; that seems more reasonable. But workstations aren't syncing with it. So I looked at my other DC, which has none of the master roles. It is exactly the same as the workstations, about 8 minutes fast.
So there are a couple of big problems here. First, my DC's don't have the same time. Second, my workstations have the same time as the faster DC (are they syncing to it?). I looked in the error logs of both DC's and filtered for the Time-Service. The PDC Emulator DC has Warning Event ID 144: The time service has stopped advertising as a good time source. The other DC has Warning Event ID 142: The time service has stopped advertising as a time source because the local clock is not synchronized. I am getting other Event ID warnings as well. On the primary DC: Event IDs 12, 36, 144 (mentioned above), 131. On the secondary DC: Event IDs 131, 24, 142 (mentioned above), 50, 129. I will give more info on these at the bottom.
From what I'm seeing, it looks like my PDCe is not pointing to an external source. Should I use the instructions here (http://support.microsoft.com/kb/816042) under "Configuring the time service to use an external time source" to set it up? The guy in the article (http://tigermatt.wordpress.com/2009/08/01/windows-time-for-active-directory/) says to use a script to automate it (w32tm /config /manualpeerlist:”uk.pool.ntp.org,0×8 europe.pool.ntp.org,0×8¿ /syncfromflags:MANUAL /reliable:yes /update). But I'm not sure if they're doing the same thing. Even if they did, I'm not sure which address I use. If I look at my secondary DC, it has an NtpServer entry of time.windows.com,0x9. The PDCe had it as well, until I did the reset that the article recommended; now it does not have an NtpServer entry.
So which method is the right one to use, and what address do I use? Does it matter if I'm running Server 2008 R2?
Event ID 12: Time Provider NtpClient: This machine is configured to use the domain hierarchy to determine its time source, but it is the AD PDC emulator for the domain at the root of the forest, so there is no machine above it in the domain hierarchy to use as a time source. It is recommended that you either configure a reliable time service in the root domain, or manually configure the AD PDC to synchronize with an external time source. Otherwise, this machine will function as the authoritative time source in the domain hierarchy. If an external time source is not configured or used for this computer, you may choose to disable the NtpClient.
Event ID 36: The time service has not synchronized the system time for 86400 seconds because none of the time service providers provided a usable time stamp. The time service will not update the local system time until it is able to synchronize with a time source. If the local system is configured to act as a time server for clients, it will stop advertising as a time source to clients. The time service will continue to retry and sync time with its time sources. Check system event log for other W32time events for more details. Run 'w32tm /resync' to force an instant time synchronization.
Event ID 144: The time service has stopped advertising as a good time source.
Event ID 131: NtpClient was unable to set a domain peer to use as a time source because of DNS resolution error on ''. NtpClient will try again in 3473457 minutes and double the reattempt interval thereafter. The error was: The requested name is valid, but no data of the requested type was found. (0x80072AFC).
Event ID 24: Time Provider NtpClient: No valid response has been received from domain controller DC-DNS.domain.org [this is our primary DC] after 8 attempts to contact it. This domain controller will be discarded as a time source and NtpClient will attempt to discover a new domain controller from which to synchronize. The error was: The peer is unreachable.
Event ID 142: The time service has stopped advertising as a time source because the local clock is not synchronized.
Event ID 50: The time service detected a time difference of greater than 5000 milliseconds for 900 seconds. The time difference might be caused by synchronization with low-accuracy time sources or by suboptimal network conditions. The time service is no longer synchronized and cannot provide the time to other clients or update the system clock. When a valid time stamp is received from a time service provider, the time service will correct itself.
Event ID 129: NtpClient was unable to set a domain peer to use as a time source because of discovery error. NtpClient will try again in 3145779 minutes and double the reattempt interval thereafter. The error was: The entry is not found. (0x800706E1)
I had an issue with a small client where the only DC was running as a VM. The clock would be slow by seconds per day, over weeks or months it could be out by 20 minutes.
Following the instructions found here: http://technet.microsoft.com/en-us/library/cc794937(v=ws.10).aspx I used w32tm /stripchart /computer:time.windows.com /samples:5 /dataonly to determine how far out the clock was with the time.windows.com server (you can use any ntp server you like):
Tracking time.windows.com [64.4.10.33].
Collecting 5 samples.
The current time is 23/06/2013 8:12:34 AM (local time).
08:12:34, -53.2859637s
08:12:37, -53.4214102s
08:12:39, -53.3859342s
08:12:41, -53.2913859s
08:12:43, -53.2440682s
I then used w32tm /config /manualpeerlist:time.windows.com /syncfromflags:manual /reliable:yes /update to tell the server to use time.windows.com as its external time source:
The command completed successfully.
I then used w32tm /resync to force it to re-sync with time.windows.com now:
Sending resync command to local computer...
The command completed successfully.
I then used the first command again to confirm that the difference was near enough to 0 seconds:
Tracking time.windows.com [64.4.10.33].
Collecting 5 samples.
The current time is 23/06/2013 8:13:54 AM (local time).
08:13:54, -00.1657880s
08:13:56, +00.0059062s
08:13:59, -00.0088913s
08:14:01, +00.0030319s
08:14:03, +00.0063458s
Please note that the information was for an environment with a single DC. If you have more than 1 DC, you need to perform the above steps on the DC which holds the PDC Emulator FSMO role.
Hope this helps someone.
The Forest root PDC Emulator (ONLY!) may sync externally. http://technet.microsoft.com/en-us/library/cc794937(v=ws.10).aspx All other Clients, Servers, and DCs should use NT5DS. POOL.NTP.ORG is a good choice.
On all other DCs use:
net stop w32Time
w32tm /unregister
w32tm /register
net start w32time
to reset the time service to use NT5DS as stated in http://technet.microsoft.com/en-us/library/cc738995(v=ws.10).aspx.
If clients or other servers are still having problems, use the same technique per GPO for example, as admin rights are required.
You also need to be very weary of VM Domain Controllers, as they may or may not keep acurate time depending on the Host's CPU utilization! Differences of several minuts are common, and deadly - as far as Kerberos is concerned.
One thing you need to clear up - are these DCs VMs running in Hyper-V or are they physical servers? If they're running in Hyper-V, there's a setting which passes the VM host time to the VMs. All you have to do is turn that sync off, then use the w32tm command to set your DCs to an NTP server like time.windows.com as indicated above.
I don't recall the setting off the top of my head, but I had this problem as well...5 DCs all showing different times.

Intermittent connection timeouts to Solr server using SolrNet

I have a production webserver hosting a search, and another machine which hosts the Solr search server (on a subnet which is in the same room, so no network problems). All is fine >90% of the time, but I consistently get a small number of The operation has timed out errors.
I've increased the timeout in the SolrNet init to 30 seconds (!)
SolrNet.Startup.Init<SolrDataObject>(
new SolrNet.Impl.SolrConnection(
System.Configuration.ConfigurationManager.AppSettings["URL"]
) {Timeout = 30000}
);
...but all that happened is I started getting this message instead of Unable to connect to the remote server which I was seeing before. It seems to have made no difference to the amount of timeout errors.
I can see nothing in any log (believe me I've looked!) and clearly my configuration is correct because it works most of the time. Anyone any ideas how I can find more information on this problem?
EDIT:
I have now increased the number of HttpRequest connections from 2 to 'a large number' (I see up to 10 connections) - but this has had no discernible effect on this problem.
The firewall is set to allow ANY connections between the two machines.
We've also checked the hardware with our server host and there are no problems on the connections, according to them.
EDIT 2:
We're still seeing this issue.
We're now logging the timeouts and they're mostly just over 30s - which is the SolrNet layer's timeout; some are 20s, though - which is the Tomcat default timeout period - which suggests it's something in the actual connection between the machines.
Not sure where to go from here, though - they're on a VLAN and we're specifically using the VLAN address - response time from pings is ALWAYS <1ms.
Without more information, I can only guess a few possible reasons:
You're fetching tons of documents per query, and it times out while transferring data.
You're hitting the ServicePoint.ConnectionLimit. If so, just increase this value. See also How can I programmatically remove the 2 connection limit in WebClient
You have some very facet-heavy requests or misusing Solr (e.g. not using filter queries). Check the qtime in the response. See the Solr performance wiki for more details.
Try setting this in .net.
ServicePointManager.Expect100Continue = false;
or this
ServicePointManager.SetTcpKeepAlive(true, 200000, 200000); - this sends requests to the server to keep the connection alive.

Scheduling a RichCopy Jobs

Anyone use the timer feature of RichCopy? I have a job that works fine when I manually start the job. However, when I schedule the job and click run, the app appears to be waiting for the scheduled time to elapse yet never fires. Interesting enough when I stop the job the copy starts.
Anyone have any experience with using RichCopy timer?
IanB
Try created a batch file with command line options. Then use windows scheduler to launch the batch.
OMBG (Bill Gates) You need to read and get security policy and the respect it has to place on a hierarchy of upstream objects and credentials. Well that's the MS answer and attitude...
The reality is if you are working with server OSs you need to understand their security & policy frameworks, and how to debug them :). If your process loses the necessary file permissions or rights (2 different things) you should ask: "Hot damn, why didn't I fix that in the config/setup". People that blast the vendor/project (or even ####&$! MS) are just blinding themselves to the solution/s.
In most cases this kind of issue is due to Windows' AD removing the rights of a Local administrator User to run a scheduled task. It is a common security setting in corporate networks (implemented with glee by Domain Admins to upset developers) though it is really a default setting these days. It happens because the machine updates against an upstream policy (after you've scheduled a task) and decides that all of a sudden it won't trust you to run it (even though previously it let you set it up). In a perfect world it wouldn't let you set it up in the first place, but that isn't the way policy applies in Windows... (####&$! MS). LOL
Wow it only took 5 months to get an answer! (but here they are for the next person at least!)

Resources