Nagios: check multiple services simultaneously? - nagios

I've just started using Nagios to monitor a group of broadcast transmitters. Each transmitter is defined as a host, and each aspect of the transmitter I wish to monitor (RF forward, RF reflected, power supply voltages, etc) is defined as a service. In doing so, I can get an alarm if any of these aspects are out of tolerance, and can use the performance data to graph each aspect (using pnp4nagios, in this case).
To check the transmitters' telemetry data, I wrote some scripts, one to address the unique facilities of each make/model of transmitter involved. In keeping with the way I've seen other Nagios checks work, an argument to the script allows you to select which aspect you want reported.
At first I was content with this. It worked like any more-traditional use of Nagios I'd encountered. But then I hit a snag.
Because each service check is scheduled individually, diagnosing an alarm condition can be tricky, since the various services aren't all being checked at the same time - and therefore the set of values I'm looking at is unlikely to be time-aligned. If all the service check values were from the same moment in time, it would be easier to detect correlations (since the set of values would essentially be a snapshot).
My first thought would be to deal with this by running a single instance of a single command, which would return values for multiple services. This would also seem far more efficient than opening as many connection instances as there are services to be checked. From a scripting perspective, this is easily done. But from a Nagios config perspective, I don't know how (or if?) you'd do that.
I know I could also divorce the data collection from the Nagios check, caching the telemetry values all at once periodically, and feeding Nagios values from the cache. But I don't want to introduce added delays if I can help it.
Thoughts?

My first thought would be to deal with this by running a single instance of a single command, which would return values for multiple services. This would also seem far more efficient than opening as many connection instances as there are services to be checked. From a scripting perspective, this is easily done. But from a Nagios config perspective, I don't know how (or if?) you'd do that.
There's nothing strange about this from a Nagios perspective, because what you're essentially doing is writing your own plugin, and plugins can be as general or specific as you want them to be.
When writing your own plugin, it's good to remember:
Your script is responsible for all failures, so make sure you handle garbage responses, failed connections and whatever other errors you predict may happen in the plugin itself, and exit with appropriate error levels.
Since you may encounter errors you didn't expect, it probably makes sense to have the plugin write what it's doing to a log file, as well as what responses it got.
The plugin must use exit codes to alert Nagios correctly. If you want performance data, it needs to be given in the correct syntax. See the development guidelines.
I'm considering submitting the service data passively. It would solve all the problems I mentioned. But it would create a few minor new ones - now there's external processes to keep running, and it's a little outside the mainstream way of doing things (might put a future admin through a little pain to figure out how it works).
I don't think this is a better solution than writing your own plugin, unless the data is coming from nodes actively pushing it out.
For example, in an IoT context, the nodes you are monitoring may actually be sending passive check results directly to the Nagios instance. In that setting, passive checks make sense, because you just want to take whatever someone else gives you and action in case no results come in (freshness).
In your case, it sounds like writing your own script would take care of both the timing issue and whatever else additional logic you want in your script, and as far as Nagios is concerned it should only run it on a schedule and watch the exit codes, then act as configured if it fails.

Related

Logging 3000 events per second from c program

What are my best options for logging 3k events per second from a c file ? Following of the options which come to my mind. Not able to decide which would be robust solution with less failure points, higher reliability and less latency.
Use a messaging server to relay events as they happen
Use syslog for logging
Use Unix pipe
Use of logging agents like fluent which will send events to analysis server
Write a log file locally and then rotate periodically rotate it to analysis server using something like rsync
Try syslog. No reason to make it too complicated. With syslog-ng you can do local logging through UDP, then set up the local syslogd to forward everything through TCP to a central syslog server. You might need to run without fsync on the central syslog server to keep up with that load (but test first), but that can be mitigated with forwarding everything to two separate machines. This gives you the asynchronous performance locally and enough reliability that you should almost never lose events.
Another option I've done is to log events into Redis, Riak or some other nosql data store (I usually don't recommend them for anything complex, but event logging is right up their alley). Set up mirroring for redundancy and they should be able to keep up way more than 3k events per second.

Is RabbitMQ, ZeroMQ, Service Broker or something similar an appropriate solution for creating a high availability database webservice?

I have a CRUD webservice, and have been tasked with trying to figure out a way to ensure that we don't lose data when the database goes down. Everyone is aware that if the database goes down we won't be able to get "reads" but for a specific subset of the operations we want to make sure that we don't lose data.
I've been given the impression that this is something that is covered by services like 0MQ, RabbitMQ, or one of the Microsoft MQ services. Although after a few days of reading and research, I'm not even certain that the messages we're talking about in MQ services include database operations. I am however 100% certain that I can queue up as many hello worlds as I could ever hope for.
If I can use a message queue for adding a layer of protection to the database, I'd lean towards Rabbit (because it appears to persist through crashes) but since the target is a Microsoft SQL server databse, perhaps one of their solutions (such as SQL Service Broker, or MSMQ) is more appropriate.
The real fundamental question that I'm not yet sure of though is whether I'm even playing with the right deck of cards (so to speak).
With the desire for a high-availablity webservice, that continues to function if the database goes down, does it make sense to put a Rabbit MQ instance "between" the webservice and the database? Maybe the right link in the chain is to have RabbitMQ send messages to the webserver?
Or is there some other solution for achieving this? There are a number of lose ideas at the moment around finding a way to roll up weblogs in the event of database outage or something... but we're still in early enough stages that (at least I) have no idea what I'm going to do.
Is message queue the right solution?
Introducing message queuing in between a service and it's database operations is certainly one way of improving service availability. Writing to a local temporary queue in a store-and-forward scenario will always be more available than writing to a remote database server, simply by being a local operation.
Additionally by using queuing you gain greater control over the volume and nature of database traffic your database has to handle at peak. Database writes can be queued, routed, and even committed in a different order.
However, in order to do this you need to be aware that when a database write is performed it is processed off-line. Even under conditions where this happens almost instantaneously, you are losing a benefit that the synchronous nature of your current service gives you, which is that your service consumers can always know if the database write operation is successful or not.
I have written about this subject before here. The user posting the question had similar concerns to you. Whether you do this or not is a decision you have to make based on whether this is something your consumers care about or not.
As for the technology stacks you are thinking of this off-line model is implementable with any of them pretty much, with the possible exception of Service broker, which doesn't integrate well with code (see my answer here: https://stackoverflow.com/a/45690344/569662).
If you're using Windows and unlikely to need to migrate, I would go for MSMQ (which supports durable messaging via transactional queues) as it's lightweight and part of Windows.

simple Solr deployment with two servers for redundancy

I'm deploying the Apache Solr web app in two redundant Tomcat 6 servers,
to provide redundancy and improved availability. At this point, scalability is not a issue.
I have a load balancer that can dynamically route traffic to one server or the other or both.
I know that Solr supports master/slave configuration, but that requires manual recovery if the slave receives updates during the master outage (which it will in my use case).
I'm considering a simpler approach using the ability to reload a core:
- only one of the two servers is receiving traffic at any time (the "active" instance), but both are running,
- both instances share the same index data and
- before re-routing traffic due to an outage, the now active instance is told to reload the index core(s)
Limited testing of failovers with both index reads and writes has been successful. What implications/issues am I missing?
Your thoughts and opinions welcomed.
The simple approach to redundancy your considering seems reasonable but you will not be able to use it for disaster recovery unless you can share the data/index to/from a different physical location using your NAS/SAN.
Here are some suggestions:-
Make backups for disaster recovery and test those backups work as an index could conceivably have been corrupted as there are no checksums happening internally in SOLR/Lucene. An index could get wiped or some records could get deleted and merged away without you knowing it and backups can be useful for recovering those records/docs at a later time if you need to perform an investigation.
Before you re-route traffic to the second instance I would run some queries to load caches and also to test and confirm the current index works before it goes online.
Isolate the updates to one location and process and thread to ensure transactional integrity in the event of a cutover as it could be difficult to manage consistency as SOLR does not use a vector clock to synchronize updates like some databases. I personally would keep a copy of all updates in order separately from SOLR in some other store just in case a small time window needs to be repeated.
In general, my experience with SOLR has been excellent as long as you are not using cutting edge features and plugins. I have one instance that currently has 40 million docs and an uptime of well over a year with no issues. That doesn't mean you wont have issues but gives you an idea of how stable it could be.
I hardly know anything about Solr, so I don't know the answers to some of the questions that need to be considered with this sort of setup, but I can provide some things for consideration. You will have to consider what sorts of failures you want to protect against and why and make your decision based on that. There is, after all, no perfect system.
Both instances are using the same files. If the files become corrupt or unavailable for some reason (hardware fault, software bug), the second instance is going to fail the same as the first.
On a similar note, are the files stored and accessed in such a way that they are always valid when the inactive instance reads them? Will the inactive instance try to read the files when the active instance is writing them? What would happen if it does? If the active instance is interrupted while writing the index files (power failure, network outage, disk full), what will happen when the inactive instance tries to load them? The same questions apply in reverse if the 'inactive' instance is going to be writing to the files (which isn't particularly unlikely if it wasn't designed with this use in mind; it might for example update some sort of idle statistic).
Also, reloading the indices sounds like it could be a rather time-consuming operation, and service will not be available while it is happening.
If the active instance needs to complete an orderly shutdown before the inactive instance loads the indices (perhaps due to file validity problems mentioned above), this could also be time-consuming and cause unavailability. If the active instance can't complete an orderly shutdown, you're gonna have a bad time.

Handling database queries that fail due to a server failover

In an environment with a SQL Server failover cluster or mirror, how do you prefer to handle errors? It seems like there are two options:
Fail the entire current client request, and let the user retry
Catch the error in your DAL, and retry there
Each approach has its pros and cons. Most shops I've worked with do #1, but many of them also don't follow strict transactional boundaries, and seem to me to be leaving themselves open for trouble in the event of failure. Even so, I'm having trouble talking them into #2, which should also result in a better user experience (one catch is the potentially long delay while the failover happens).
Any arguments one way or the other would be appreciated. If you use the second approach, do you have a standard wrapper that helps simplify implementation? Either way, how do you structure your code to avoid issues such as those related to the lack of idempotency in the command that failed?
Number 2 could be an infinite loop. What if it's network related, or the local PC needs rebooted, or whatever?
Number 1 is annoying to users, of course.
If you only allow access via a web site, then you'll never see the error anyway unless the failover happens mid-call. For us, this is unlikely and we have failed over without end users realising.
In real life you may not have nice clean DAL on a web server. You may have an Excel sheet connecting (most financials) or WinForms where the connection is kept open, so you only have the one option.
Fail over should only take a few seconds anyway. If the DB recovery takes more than that, you have bigger issues anyway. And if it happens often enough to have to think about handling it, well...
In summary, it will happen that rarely that you want to know and number 1 would be better. IMHO.

Should you test an external system prior to using it?

Note: This is not for unit testing or integration testing. This is for when the application is running.
I am working on a system which communicates to multiple back end systems, which can be grouped into three types
Relational database
SOAP or WCF service
File system (network share)
Due to the environment this will run in, there are no guarantees that any of those will be available at run time. In fact some of them seem pretty brittle and go down multiple times a day :(
The thinking is to have a small bit of test code which runs before the actual code. If there is a problem then persist the request and poll until the target system until it is available. Tests could possibly be rerun within the code to check it is still available at logical points. The ultimate goal is to have a very stable system, regardless of the stability (or lack thereof) of the systems it communicates to.
My questions around this design are:
Are there major issues with it? (small things like the fact it may fail between the test completing and the code running are understandable)
Are there better ways to implement this sort of design?
Would using traditional exception handling and/or transactions be better?
Updates
The system needs to talk to the back end systems in a coordinated way.
The system is very async in nature so using things like queuing technologies is fine.
The system must run even if one or more backend systems are down as others may be up and processing of some information is possible.
You will be needing that traditional exception handling no matter what, since as you point out there's always the chance that things'll fail between your last check and the actual request. So I really think any solution you find should try to interact smoothly with this.
You are not stating if these flaky resources need to interact in some kind of coordinated manner, which would indicate that you should probably be using a transaction manager of some sort to do this. I do not believe you want to get into the footwork of transaction management in application code for most needs.
Sometimes I have also seen people use AOP to encapsulate retry logic to back-end systems that fail (for instance due to time-out issues). Used sparingly this may be a decent solution.
In some cases you can also use message queuing technology to alleviate unstable back-ends. You could for instance commit to a message queue as part of a transaction, and only pop off the queue when successful. But this design is normally only possible when you're able to live with an asynchronous process.
And as always, real stability can only be achieved by attacking the root cause of the problem. I had a 25-year old bug fixed in a mainframe TCP/IP stack fixed because we were overrunning it, so it is possible.
The Microsoft Smartclient framework provides a ConnectionMonitor class. Should be easy to use or duplicate.
Our approach to this kind of issue was to run a really basic 'sanity tester' prior to bringing up our main application. This was thick client so we could run the test every time the app started. This sanity test would go out and check things like database availability, and external network (extranet) access, and it could have been extended to do webservices as well.
If there was a failure, the user was informed, and crucially an email was also sent to the support/dev team. These emails soon became unweildy as so many were being created, but we then setup filters, so we knew when somethings really bad was happening. Overall the approach worked pretty well, our biggest win was being able to tell users that the system was down, before they had entered data, and got part way through a long winded process. They absolutely loved it.
At a technica level the sanity was written in C#, it used exception handling in a conventional way not to find the problems it was looking for. The sanity program became a mini app in its own right, and it was standalone from the main app. If I were doing it again I'd using a logging framework to capture issues, which is more flexible then our hard coded approach.

Resources