Nagios Core 3 custom notification exception - nagios

I'm hoping to avoid a hacked together mishmash to achieve something. I know it can be done with a mishmash but let's see if I'm missing a SIMPLE, easy way. This is Nagios Core 3.
I have a service. That service is checked 24x7x365. Notifications are sent 24x7x365, on WARNING and also on CRITICAL.
That is good--that is what I want.
However...now I want one single exception to that notification setup. NOTE: I do not want an exception to the monitoring setup--I want the console to always show the correct status, 24x7. I just want to make one exception for the notification (via email) on this service.
Here is the exception:
IF service state is WARN AND time of day is between 0300 and 0600, do NOT notify.
That's it. If it's CRITICAL, email-notify 24x7 (as it already does). If it's not between 3 and 6 a.m., notify regardless of WARN vs. CRIT (as it already does). The only exception is WARNING and 3-6 a.m.
Background: This is because we have maintenance that occurs every night between 3 and 6, which we've customized to produce a WARNING (not CRITICAL). I want notifications any time outside of this (admin may have accidentally launched maint in middle of day), and I want CRITICAL any time. I don't want to simply skip CHECKS during that time because I do want the console to be correct (a big bunch of yellows 0300-0600).
So, anyway, seems like I can kludge together a bunch of constructs but does anybody have a simple way to define this one "boolean AND" condition to the notification (only) schedule?

This is what scheduled downtime is for. If you create a scheduled downtime window alerts will be suppressed during this timeframe.
If that's not an option, then you need to different contacts for this service. 1 that notifies 24x7 and only on CRITICAL, and the other notifies 24x7(sans 3-6), and only receives WARNING notifications. Have them both point to the same contact email address.

Related

NDB query().iter() of 1000<n<1500 entities is wigging out

I have a script that, using Remote API, iterates through all entities for a few models. Let's say two models, called FooModel with about 200 entities, and BarModel with about 1200 entities. Each has 15 StringPropertys.
for model in [FooModel, BarModel]:
print 'Downloading {}'.format(model.__name__)
new_items_iter = model.query().iter()
new_items = [i.to_dict() for i in new_items_iter]
print new_items
When I run this in my console, it hangs for a while after printing 'Downloading BarModel'. It hangs until I hit ctrl+C, at which point it prints the downloaded list of items.
When this is run in a Jenkins job, there's no one to press ctrl+C, so it just runs continuously (last night it ran for 6 hours before something, presumably Jenkins, killed it). Datastore activity logs reveal that the datastore was taking 5.5 API calls per second for the entire 6 hours, racking up a few dollars in GAE usage charges in the meantime.
Why is this happening? What's with the weird behavior of ctrl+C? Why is the iterator not finishing?
This is a known issue currently being tracked on the Google App Engine public issue tracker under Issue 12908. The issue was forwarded to the engineering team and progress on this issue will be discussed on said thread. Should this be affecting you, please star the issue to receive updates.
In short, the issue appears to be with the remote_api script. When querying entities of a given kind, it will hang when fetching 1001 + batch_size entities when the batch_size is specified. This does not happen in production outside of the remote_api.
Possible workarounds
Using the remote_api
One could limit the number of entities fetched per script execution using the limit argument for queries. This may be somewhat tedious but the script could simply be executed repeatedly from another script to essentially have the same effect.
Using admin URLs
For repeated operations, it may be worthwhile to build a web UI accessible only to admins. This can be done with the help of the users module as shown here. This is not really practical for a one-time task but far more robust for regular maintenance tasks. As this does not use the remote_api at all, one would not encounter this bug.

Salesforce Schedulable not working

I have several HTTP callouts that are in a schedulable and set to run ever hour or so. After I deployed the app on the app exchange and had a salesforce user download it to test, it seems the jobs are not executing.
I can see the jobs are being scheduled to run accordingly however the database never seems to change. Is there any reason this could be happening or is there a good chance the flaw lies in my code?
I was thinking that it could be permissions however I am not sure (its the first app I am deploying).
Check if the organisation of your end user has added your endpoint to "remote site settings" in the setup. By endpoint I mean an address that's being called (or just the domain).
If the class is scheduled properly (which I believe would be a manual action, not just something that magically happens after installation... unless you've used a post-install script?) you could also examine Setup -> Apex Jobs and check if there are any errors. If I'm right, there will be an error about callout not allowed due to remote site settings. If not - there's still a chance you'll see something that will make you think. For example batch job has executed successfully but there were 0 iterations -> problem?
Last but not least - you can always try the debug logs :) Enable them in Setup (or open the developer console), fire the scheduled class's execute() manually and observe the results? How to fire it manually? Sth like this pasted to "execute anonymous":
MySchedulableClass sched = new MySchedubulableClass();
sched.execute(null);
Or - since you know what's inside the scheduled class - simply experiment.
Please note that if the updates you might be performing somehow violate for example validation rules your client has - yes, the database will be unchanged. But in such case you should still be able to see failures in Setup -> Apex Jobs.

receive an alert when job is *not* running

I know how to set up a job to alert when it's running.
But I'm writing a job which is meant to run many times a day, and I don't want to be bombarded by emails, but rather I'd like a solution where I get an alert when the job hasn't been executed for X minutes.
This can be acheived by setting the job to alert on execution, and then setting up some process which checks for these alerts, and warns when no such alert is seen for X minutes.
I'm wondering if anyone's already implemented such a thing (or equivalent).
Supporting multiple jobs with different X values would be great.
The danger of this approach is this: suppose you set this up. One day you receive no emails. What does this mean?
It could mean
the supposed-to-be-running job is running successfully (and silently), and so the absence-of-running monitor job has nothing to say
or alternatively
the supposed-to-be-running job is NOT running successfully, but the absence-of-running monitor job has ALSO failed
or even
your server has caught fire, and can't send any emails even if it wants to
Don't seek to avoid receiving success messages - instead devise a strategy for coping with them. Because the only way to know that a job is running successfully is getting a message which says precisely this.

SQL Server Service Broker Service Disappearing (Automatically Deleted)?

I've implemented a messaging system over SQL Server Service Broker. It is working great, with the sole exception that every once in a while (maybe once per week per server) my initiator service just vanishes without a trace. The corresponding queue is still there, but the service is missing.
Obviously this causes problems in my system. It's a simple matter to recreate the service by hand, but I'm confused as to what might cause this behavior. I understand that automatic poison message handling causes queues to be disabled, but I don't see anything that indicates services can be disabled or deleted automatically.
When this happens, I usually have a large backlog of messages in multiple application queues, but nothing extreme. Total message backlog is around 200,000.
Does anyone know what might be happening here?
You must have a bug of some sort that issues a DROP SERVICE statement. That is the only way a service gets deleted.
Check the default trace, the DROP statement gets traced and saved into it so you can track down the application/user/statement that issues the DROP. Check sys.traces to find the location of the default trace then open the .TRC file in Profiler.

Best practice for updating silverlight deployment that is actively being used

I am currently running a SL3 project where we are in a highly iterative development mode with about 25 active test customers. I am making small changes at a clip of about 4 new builds per day. It is important to know this application is mission critical line of business for these 25 people, it is the tool they use all day to do their work so they are using it constantly and often launch their browser and the app in the morning and never close it until the end of the day.
The challenge is that when I make an update to the application I have no clean way to notify the users, in most cases this is ok as it is rare that I introduce a data contract change or something that would be a classic 'breaking' change to the app/service. Users keep plugging along and will get the change next time they refresh.
Right now we have resorted to emailing everyone and telling them to force refresh or close the browser and log back in.
Surely there is a better way...
Right now my train of thought is to have a method on the server that compares client xap versions and determines if the client being used is the most up to date, if so I will notify the user and make them update.
What have you done to solve this problem?
One way of doing it is to use a push mechanism (I used Kaazing Websoocket Gateway but any would do). When a new version of the XAP is released a message (either manually entered into the system by admin or automated triggered by XAP file change event) would be sent to all the clients. In the simplest scenario some notification would be shown to a user (telling him that a new version is released and the application needs to refresh) and then the app would refresh (by simply reloading the page) saving user's state if necessary.
If I would do this I would just keep it simple. A configuration value in web.config and a corresponding service method that simply returns that value (the value itself could be anything, but a counter is probably wise). Then you could have your Silverlight app poll that service method at regular intervals. Whenever the value changes (which you would do manually when you deploy a new version), just pop up a dialog telling the user to refresh the browser or log in/out. This way you don't have to force them to refresh every time. If you go with the idea of comparing xap file versions they will always be required to refresh, even for non-breaking changes.
If you want to take it further you could come up with some sort of mechanism to distinguish between different severity levels. For instance, if the new config value would contain the string "update_forced", you could force the users to reload the app by logging them out automatically (a little harsh, perhaps). If it contains the string "update_recommended", just show a little icon at the top right corner saying that there is a new version and that they should upgrade in their own time.
Granted, this was targeted at Silverlight 3, but with the PollingDuplex client and such in the newer versions of Silverlight, you could publish an "Update Now" bit to the clients, and build a mechanism in the client to alert the user that there is an update that is now out... that they should update it shortly, etc. You may even be able, through serialization and such, to save the state that they are in when they close the app to reload it.
We've done stuff similar with a LOB app that we built, so that as users are changing things, the rest of the userbase sees those changes immediately. Next up will be putting the flags in to change authorization and upgrades "on the fly" if you will.

Resources