How to decide warning and critical levels in nagios? - nagios

Can someone please tell on what basis should I decide warning and critical level for a check in nagios . I want to have notification for warning state before the check sends out critical . The check I am performing is check_es_logs that is Elastic Search logs.Performance data is below:

Warning and Critical thresholds are plugin-specific, which means that they are set in your check command. The plugin decides what state it should return (technically which exit code) based on the thresholds that you pass with flags to it, normally -w and -c.
If you want a more specific answer you need to ask a more specific question. For example:
I want to have notification for warning state before the check sends out critical
This doesn't make a lot of sense to me since you would just end up with two notifications, one saying Warning and one saying Critical.
You can do additional filtering, for example by saying that a contact should only receive notifications of a certain state, but since I don't know what you want your exact end result to be I'm not sure what to suggest.
Also, Icinga is a very different product from Nagios Core for example, so if this is an Icinga specific question it should not be tagged Nagios as the syntax for config files etc. will likely differ a lot.

Related

Salesforce Apex CPU LImit

currently we are having issue with an CPU Limit. We do have a lot of processes that are most likely not optimized, I have already combined some processes for the same object but it is not enough. I am trying to understand logs rights now - as you can see on the screenshots, there is one process that is being called multiple times (I assume each time for created record). Even if I create, for example, 60 records in one operation/dml statement, the Process Builders still gets called 60 times? (this is what I think is happening) Is that a problem we are having right now? If so, is there a better way to do it? Because right now we need updates from PB to run, but I expected it should get bulkified or something like that. I was also thinking there might be some looping between processes. If there are more information you need, please let me know. Thank you.
Well, yes, the process builder will be invoked 60 times, 1 record at a time. But that shouldn't be your problem. The final update / create child records / email send (or whatever your action is) will be bulkified, it won't save 1 record at a time. If the process calls some apex actions - they're supposed to support passing collection of records, not just single record.
You maybe looking at wrong place. CPU time suggests code problems, not config (flow, workflow, process builder... although if you're doing updates of fields on "this" record it's possible you'd benefit from before-save flows). Try to compare timestamps related to METHOD_BEGIN, METHOD_END for triggers, code methods (including invocable action / process plugin interfaces).
Maybe there's code that doesn't need to run because key fields didn't change, there's nothing to recalculate, rollup. Hard to say without seeing the debug log.
Maybe the operation doesn't have to be immediate. Think if you can offload some stuff to "scheduled actions", "time based workflows" or in apex terms "#future, batchable, queueable". But they'd have to be relatively safe to run, if there's error - it won't display to the user because the action will be in the background, you'd need to handle the errors manually (send an email, create a record, make chatter post or bell notification).
You could try uploading the log to https://apextimeline.herokuapp.com/ and try to make sense out of that Gantt-chart-like output. Or capture the log "pro" way, with https://help.salesforce.com/s/articleView?id=sf.code_dev_console_solving_problems_using_system_log.htm&type=5 or https://marketplace.visualstudio.com/items?itemName=financialforce.lana (you'll likely need developer's help to make sense out of it).

Handling poison messages in Apache Flink

I am trying to figure out the best practices to deal with poison messages / unhandled exceptions with Apache Flink. We have a Job doing real time event processing of location data from IoT devices. There are two potential scenarios where this can arise:
Data is bad in some way - e.g. invalid value
Data triggers a bug due to some edge case we have not anticipated.
Currently, all my data processing stops because of just one message.
I've seen two suggestions:
Catch the exceptions - this requires me wrapping every piece of logic with something to catch every runtime exception
Use side outputs as a kind of DLQ - from what I can tell this seems to be a variation on #1 where I have to catch all the exceptions and send them to the side output.
Is there really no way to do this other than wrap every piece of logic with exception handling? Is there no generic way to catch exceptions and not have processing continue?
I think the idea is not to catch all kinds of exceptions and send them elsewhere, but rather to have well-tested and functioning code and use dead letters only for invalid inputs.
So a typical pipeline would be
source => validate => ... => sink
\=> dead letter queue
As soon as your record passes your validate operator, you want all errors to bubble up, as any error in these operators may result in corrupted aggregates and data that - once written - cannot be reverted easily.
The validate step would work with any of the two approaches that you outlined. Typically, side-outputs have better semantics, but you may end up with more code.
Now you may have a service with high SLAs and actually want it to produce output even if it is corrupted just to produce data. Or you have simple transformation pipeline, where you'd miss some events but keep the majority (and downstream can deal with incomplete data). Then you are right that you need to wrap the code of all operators with try-catch. However, you'd typically still would only do it for the fragile operators and not for all of them. Trivial operators should be tested and then trusted to work. Further, you'd usually only catch specific kinds of exceptions to limit the scope to the kind of expected exceptions that can happen.
You might wonder why Flink doesn't have it incorporated as a default pattern. There are two reasons as far as I can see:
If Flink silently ignores any kind of exception and sends an extra message to a secondary sink, how can Flink ensure that the throwing operator is in a sane state afterwards? How can it avoid any kind of leaks that may happen because cleanup code is not executed?
It's more common in Java to let the developers explicitly reason about exceptions and exception handling. It's also not straight-forward to see what the requirements are: Do you want to have the input only? Do you also want to store the exception? What about the operator state that may have influenced the outcome? Should Flink still fail when too many errors have been received in a given time window? It quickly becomes a huge feature for something that should not happen at all in an ideal world where high quality data is ingested and properly processed.
So while it looks easy for your case because you exactly know which kinds of information you want to store, it's not easy to have a solution for all purposes, especially since the extra code that a user has to write is tiny compared to the generic solution.
What you could do is to extract most of the complicated logic things into a single ProcessFunction and use side-outputs as you have outlined. Since it's a central piece, you'd only need to write the side-output function once. If it's done multiple times, you could extract a helper function where you pass your actual code as a RunnableWithException lambda which hides all the side-output logic. Make sure you use plenty of finally blocks to ensure a sane state.
I'd also add quite a few IT cases and use mutation testing to harden your pipeline quicker. If you keep your test data inline, the mutants may also exactly simulate your unexpected data issues, such that your validate operator gets more complete.

Requesting irq for a multi channel device

Assume a pci-driver for the linux-kernel. This device can have multiple channels that can be "up'ed" or "down'ed" individually.
Each "up" calls the function .ndo_open and each "down" calls .ndo_stop.
This device needs only one interrupt-line which can be requested with request_irq (). Each request will create one interrupt-line.
Important to note here is, that interrupt-lines are rare and they should not be created mindless.
My question to this situation is, where should I use request_irq()?
In my opinion I have two possible solutions for this.
Right in the probe(). This will only create one interrupt-line but it will always be created when the pc is turned on. So it might be unused.
In .ndo_open. This will create the interrupt-line only when it is needed, but a multichannel device can create mutliple calls of .ndo_open which will result in multiple calls of request_irq()
I was not able to find any information about this situation in the kernel docs. If there is some guideline for this, can you please explain/show it to me? I also checked other pci-drivers from the git-repo but none (or at least the ones I checked) had this problem.

How to catch "Nrpe unable to read output" when occured?

I'm trying to catch "nrpe unable to read output" output from plugin and send an email when this one occurs and I'm a little bit stuck :) . Thing is there are different return codes when this error occurs on different plugin:
Return code Service status
0 OK
1 WARNING
2 CRITICAL
3 UNKNOWN
Is there a way either to unify return codes of all plugins I use(that there always will be 2[CRITICAL] when this problem occurs), or any other way to catch those alerts? I want to keep return codes for different situations as is(i.e. filesystem /home will be warning(return code 1) for 95% and critical(return code 2) for 98%
Most folks would rather not have this error sending alert emails, because it does not represent an actual failed check. Basically it means nothing more than:
The command/plugin (local or remote) was ran by NRPE, but
failed to return any usable status and/or text back to nrpe.
This most often means something went wrong with the command/plugin and it hasn't done the job it was expected to perform. You don't want alerts being thrown for checks, when the check wasn't actually performed - as this would be very misleading. It's also important to note that the Return Code is not even be coming from the command/plugin.
In my experience, the number one cause of this error is a bad check. And as the docs for NPRE state, you should run the check (with all its options!) to make sure it runs correctly. Do yourself a favor and test both working AND not working states. About 75% of the time, this has happened because the check only works correctly when it has OK results, and blows up when something not-OK must be reported.
Another issue that causes these are network glitches. NRPE connects and runs the check; but the connection is closed before any response is seen. Once again, not a true check result.
For a production Nagios monitoring system, these should be very rare errors. If they are happening frequently, then you likely have other issues that need to be fixed.
And as far as I can tell, all built-in Nagios plugins use the exact same set of return codes. Are you certain this isn't a 'custom' check?
Ok, I think I've found the solution for my problems-I will try to check nagios.log on each node for those errors.

Create many passes from the app - iphone passbook

SITUATION :
I have an application where i have to issue a gift cupon kind of a thing when the user reaches a certain score say 'x'.
I want to create a coupon with a unique QRcode, at the time the user reaches the score 'x' so that he can download it on his iphone and use it. Once it is used , the cupon should be invalidated. this applies to any user using the application. Meaning a coupon is created once the score is reached and deleted or invalidated once it is used.
ISSUE :
I'm not able to figure out how to create a cupon everytime any user reaches the score. Ofcourse, i did go through a lot of documentations and links like http://www.raywenderlich.com/20734/beginning-passbook-part-1. I also tried using pass-source but the valid account requires you to pay minimum about 8$.
As suggested in raywenderlich tutorials, i can create passes but thats not created through the application.
Also i didn't see any method where we can be notified when a user uses his issued coupon so that we can invalidate it.
Am i missing something here?
"Using" a QR code on a coupon means it is scanned by something else. That something else has to take responsibility to report the activity back to you, so you could then update the pass with an "Expired" flag in your database, re-sign and rebuild the pass, issue the push notification so that it would eventually update on the device. You'd also probably want that scanner-thingie to check with you to see that the code is valid before accepting it. So, yeah, not Apple's problem.

Resources