Nagios Core Summary Macro has Wrong Count - nagios

I believe the following summary macros are not accounting for Passive Services:
$TOTALSERVICESCRITICALUNHANDLED$ (This is the one with which I see the problem directly)
And I assume the following two have the same issue:
$TOTALSERVICESWARNINGUNHANDLED$
$TOTALSERVICESUNKNOWNUNHANDLED$
Passive Services who are NOT in downtime and NOT Acknowledged rightfully show up in the Unhandled Services page of Nagios Core.
But, a script Im using spits out the value of $TOTALSERVICESCRITICALUNHANDLED$ that does not account for passive services who are non-downtime, non-ack, and in a critical state.
The wordage on this macro indicates that the service must have 'checks enabled', but this probably does not account for passive checks?:
"
This macro reflects the total number of services that are currently in a CRITICAL state that are not currently being "handled". Unhandled services problems are those that are not acknowledged, are not currently in scheduled downtime, and for which checks are currently enabled.
"
My setup:
I have a command that is executed by a regularly scheduled service. The command passes the value of macro $TOTALSERVICESCRITICALUNHANDLED$ to a script.
The script just echos the value of that macro.
Test:
All services are in downtime except my passive service who has Passive Services Enabled and is in a critical state. The script is spitting out "0" for number of unhandled critical alerts (this is incorrect!)
Enable active checks on the passive service and the script now tells me "1"
Nagios Core Version 4.3.2
Please advise whether this is a bug that was addressed in a later version or whether there is any workaround for me?
I have seen this related issue which was fixed in 4.2.2 but is a different issue: viewtopic.php?t=39957

I ended up making this change to the source code. I can assume if a service is in downtime or is Acknowledged it will already be not counted in the total alarm count. So the check for checks_enabled is redundant and incorrectly throws out passive services (seems checks_enabled is a flag that only represents ACTIVE checks)
common/macros.c starting at line 1216:
Comment out the 3 instances, 2 lines each, where it checks
"if(temp_service->checks_enabled == FALSE) problem = FALSE
(And then rebuild Nagios Core)
The only way I can see this coming back to bite me is if theres a case where active services had their active checks disabled and also were not in Downtime or Acknowledged

Related

Performing the synchronization with ExecuteOfflineCommand more effectively

I'm wondering is there a way to recognize the OfflineComamd is being executed or internal flag or something to represent this command has been passed or mark it has been executed successfully. I have issue in recognizing the command is passed or not with unstable internet. I keep retrieve the records from database and comparing each and every time to see this has been passed or not. But due to the flow of my application, I'm finding it very difficult to avoid duplicates.IS there any automatic process to make sure commands executed automatically or something else?
2nd question, I can use UITimer to check isOffline() to make sure internet is connected or not on the forms. Is there something equivalent on server page or where queries is written to see internet is disconnected or not. When the control moved to queries and internet is disconnected I see the dialog open from form page being frozen for unlimited time and will not end. I have to close and re-open the app to continue the synchronization process.At the same time I cannot set a timeout for dialog because I'm not sure how long it will take the complete the Synchronization process. Please advise.
Extending on the same topic but I have created a new issue just to give more clarity on my questions.
executeOfflineCommand skips a command while executing from storage on Android
There is no way to know if a connection will stay stable as it requires knowledge of the future. You can work like transaction services do where the server side processes an offline command as a transaction using the approach of 2-phase commit.
In this approach you have an algorithm similar to this:
Client sends command to server
Server returns a special unique ID for the command
Client asks server to perform the unique id
Server acknowledges that the command was performed
If the first 2 stages didn't complete you just do that again. The worst thing that could happen is some orphan commands on the server.
If the 3rd option didn't complete you just do it again. The server knows whether it processed the command and will just acknowledge it if it was already processed.

Nagios conditional checks

Currently I am monitoring my target windows hosts for a bunch of services (CPU, memory, disks, ssl certs, http etc). I'm using nsclient as the client that the nagios server will talk to.
My problem is that I deploy to those hosts three times every 24 hours. The deployment process requires the hosts to reboot. Whenever my hosts reboot I get nagios alerts for each service. This means a large volume of alerts, which makes it difficult to identify real issues.
Ideally I'd like to this:
If the host is down, don't send any alerts for the rest of the services
If the host is rebooting, this means that nsclient is not accessible. I want to only receive one alert (e.g CPU is not accessible) and mute everything else for a few minutes, so the host can finish booting and nsclient becomes available.
Implementing this would have me getting one email per host for each deployment. This is much better than everything turning red and me getting flooded with alerts that aren't worth checking (since they're only getting sent because the nagios client -nsclient- is not available during the reboot).
Got to love using a windows stack...
There are several ways to handle this.
If your deploys happens at the same time everyday:
1. you could modify your active time period to exclude those times (or)
2. schedule down time for your host via the Nagios GUI
If your deployments happen at different/random times, things become a bit harder to work-around:
1. when nrpe or nsclient is not reachable, Nagios will often throw an 'UNKNOWN' alert for the check. If you remove the 'u' option for the following entries:
host_notification_options [d,u,r,f,s,n]
service_notification_options [w,u,c,r,f,s,n]
That would prevent the 'UNKNOWN's from sending notifications. (or)
2. dynamically modify active checking of the impacted checks, by 'turning them off' before you start the deployment, and then 'turning them on' after the deployment. This can be automated using the Nagios 'external commands file'.
Jim Black's answer would work, or if you want to go even more in depth you can define dependencies with service notification escalation as described in the documentation below.
Escalating the alerts would mean that you could define: CPU/ssl etc check fail -> check host down -> Notifiy/don't notify.
Nagios Service Escalation (3.0)

Salesforce Schedulable not working

I have several HTTP callouts that are in a schedulable and set to run ever hour or so. After I deployed the app on the app exchange and had a salesforce user download it to test, it seems the jobs are not executing.
I can see the jobs are being scheduled to run accordingly however the database never seems to change. Is there any reason this could be happening or is there a good chance the flaw lies in my code?
I was thinking that it could be permissions however I am not sure (its the first app I am deploying).
Check if the organisation of your end user has added your endpoint to "remote site settings" in the setup. By endpoint I mean an address that's being called (or just the domain).
If the class is scheduled properly (which I believe would be a manual action, not just something that magically happens after installation... unless you've used a post-install script?) you could also examine Setup -> Apex Jobs and check if there are any errors. If I'm right, there will be an error about callout not allowed due to remote site settings. If not - there's still a chance you'll see something that will make you think. For example batch job has executed successfully but there were 0 iterations -> problem?
Last but not least - you can always try the debug logs :) Enable them in Setup (or open the developer console), fire the scheduled class's execute() manually and observe the results? How to fire it manually? Sth like this pasted to "execute anonymous":
MySchedulableClass sched = new MySchedubulableClass();
sched.execute(null);
Or - since you know what's inside the scheduled class - simply experiment.
Please note that if the updates you might be performing somehow violate for example validation rules your client has - yes, the database will be unchanged. But in such case you should still be able to see failures in Setup -> Apex Jobs.

Nagios Core 3 custom notification exception

I'm hoping to avoid a hacked together mishmash to achieve something. I know it can be done with a mishmash but let's see if I'm missing a SIMPLE, easy way. This is Nagios Core 3.
I have a service. That service is checked 24x7x365. Notifications are sent 24x7x365, on WARNING and also on CRITICAL.
That is good--that is what I want.
However...now I want one single exception to that notification setup. NOTE: I do not want an exception to the monitoring setup--I want the console to always show the correct status, 24x7. I just want to make one exception for the notification (via email) on this service.
Here is the exception:
IF service state is WARN AND time of day is between 0300 and 0600, do NOT notify.
That's it. If it's CRITICAL, email-notify 24x7 (as it already does). If it's not between 3 and 6 a.m., notify regardless of WARN vs. CRIT (as it already does). The only exception is WARNING and 3-6 a.m.
Background: This is because we have maintenance that occurs every night between 3 and 6, which we've customized to produce a WARNING (not CRITICAL). I want notifications any time outside of this (admin may have accidentally launched maint in middle of day), and I want CRITICAL any time. I don't want to simply skip CHECKS during that time because I do want the console to be correct (a big bunch of yellows 0300-0600).
So, anyway, seems like I can kludge together a bunch of constructs but does anybody have a simple way to define this one "boolean AND" condition to the notification (only) schedule?
This is what scheduled downtime is for. If you create a scheduled downtime window alerts will be suppressed during this timeframe.
If that's not an option, then you need to different contacts for this service. 1 that notifies 24x7 and only on CRITICAL, and the other notifies 24x7(sans 3-6), and only receives WARNING notifications. Have them both point to the same contact email address.

SQL Server Service Broker Service Disappearing (Automatically Deleted)?

I've implemented a messaging system over SQL Server Service Broker. It is working great, with the sole exception that every once in a while (maybe once per week per server) my initiator service just vanishes without a trace. The corresponding queue is still there, but the service is missing.
Obviously this causes problems in my system. It's a simple matter to recreate the service by hand, but I'm confused as to what might cause this behavior. I understand that automatic poison message handling causes queues to be disabled, but I don't see anything that indicates services can be disabled or deleted automatically.
When this happens, I usually have a large backlog of messages in multiple application queues, but nothing extreme. Total message backlog is around 200,000.
Does anyone know what might be happening here?
You must have a bug of some sort that issues a DROP SERVICE statement. That is the only way a service gets deleted.
Check the default trace, the DROP statement gets traced and saved into it so you can track down the application/user/statement that issues the DROP. Check sys.traces to find the location of the default trace then open the .TRC file in Profiler.

Resources