Nagios - Critical hard state if value increased - nagios

how can I set up a service check in a way, that its state goes to CRITICAL hard, whenever the value of a performance data increased?
Thanks!

Related

Flink checkpointing working for ProcessFunction but not for AsyncFunction

I have operator checkpointing enabled and working smoothly for a ProcessFunction operator.
On job failure I can see how operator state gets externalized on the snapshotState() hook, and on resume, I can see how state is restored at the initializeState() hook.
However when I try to implement the CheckpointedFunction interface and the 2 aforementioned methods on an AsyncFunction, it does not seem to work. I'm doing virtually the same as with the ProcessFunction ...but when the job is shutting down after failure, it does not seems to be stopping by the snapshotState() hook, and upon job resume, context.isRestored() is always false.
Why CheckpointedFunction.snapshotState() and CheckpointedFunction.initializeState() are not being executed with AsyncFunction but yes with ProcessFunction?
Edited:
For some reason, my checkpoints are taking very long. My config is very standard I believe, interval of 1 second, 500ms min pause, exactly once. No other tunning.
I'm getting this traces from the checkpointing coordinator
o.a.f.s.r.t.SubtaskCheckpointCoordinatorImpl - Time from receiving all checkpoint barriers/RPC to executing it exceeded threshold: 93905ms
2021-11-23 16:25:01 INFO o.a.f.r.c.CheckpointCoordinator - Completed checkpoint 4 for job 239d7967eac7900b33d7eadd483c9447 (671604 bytes in 112071 ms).
If I attempt to set a checkpointTimeout, I need to set something in the order or 5 minutes or so. How come a checkpoint of such a little state (it's just a Counter and a Long) takes 5 minutes?
I've also read that NFS volumes are a recipe for troubles, but so far I haven't run this on the cluster, I'm just testing it on my local filesystem
AsyncFunction doesn't support state at all. The reason is that state primitives are not synchronized and thus would produce incorrect results in AsyncFunction. That's the same reason why there is no KeyedAsyncFunction.
If Flink had https://cwiki.apache.org/confluence/display/FLINK/FLIP-22%3A+Eager+State+Declaration implemented then it could simply attach the state on each async call and update on successful async.
You can do some trickery with chained maps and slot sharing groups around the limitation but it's rather hacky.

What happens to keyed window-global state without TTL if a key is never seen again?

Flink's ProcessWindowFunction can use so-called global state with something like context.globalState().getState.
Usually, such state could grow and shrink as time moves forward,
but what happens if global state without TTL was created for a key and that key is never seen again?
According to the documentation,
TTL cannot be added during upgrade,
so the state will stay there forever?

About StateTtlConfig

I'm configuring my StateTtlConfig for MapState and my interest is the objects into the state has for example 3 hours of life and then they should disappear from state and passed to the GC to be cleaned up and release some memory and the checkpoints should release some weight too I think. I had this configuration before and it seems like it was not working because the checkpoints where always growing up:
private final StateTtlConfig ttlConfig = StateTtlConfig.newBuilder(org.apache.flink.api.common.time.Time.hours(3)).cleanupFullSnapshot().build();
Then I realized the that configuration works only when reading states from a savepoints but not in my scenario. I'd change my TTL configuration to this one:
private final StateTtlConfig ttlConfig = StateTtlConfig.newBuilder(org.apache.flink.api.common.time.Time.hours(3))
.setStateVisibility(StateTtlConfig.StateVisibility.NeverReturnExpired).build();
Based on the idea that I want to clean all the states for all keys after a defined time.
My questions are:
I'm I doing the right configuration right now?
What is the best way to do it?
Thanks one more time.
Kind regards!!!
I don't know enough about your use case to recommend a specific expiration/cleanup policy, but I can offer a few notes.
My understanding is that cleanupFullSnapshot() specifies that in addition to whatever other cleanup is being done, a full cleanup will be done whenever taking a snapshot.
The FsStateBackend uses the incremental cleanup strategy. By default it checks 5 entries during each state access, and does no additional cleanup during record processing. If your workload is such that there are many more writes than reads, that might not be enough. If no access happens to the state, expired state will persist. Choosing cleanupIncrementally(10, false) will make the cleanup more aggressive, assuming you do have some level of state access going on.
It's not unusual for checkpoint sizes to grow, or to take longer than you'd expect to reach a plateau. Could it simply be that the keyspace is growing?
https://flink.apache.org/2019/05/19/state-ttl.html is a good resource for learning more about Flink's State TTL mechanism.

Can we update a state's TTL value?

We have a topology that uses states (ValueState and ListState) with TTL(StateTtlConfig) because we can not use Timers (We would generate hundred of millions of timers per day, and it does scale : a savepoint/checkpoint would take hours to be generated and might even get stuck while running).
However we need to update the value of the TTL at runtime depending of the type of some incoming events and other logic. Is this alright to recreate a new state with a new StateTtlConfig (and updated TTL time) and copy the values from "old" to "new in the processElement1() and processElement2() methods of a CoProcessFunction (instead of once in the open() like we usually do) ?
I guess the "old" state would be garbage collected (?).
Would this solution scale? be performant? generate any issue? anything bad?
I think your approach can work with the state re-creation in runtime to some extent but it is brittle. The problem, I can see, is that the old state meta information can linger somewhere depending on backend implementation.
For Heap (FS) backend, eventually the checkpoint/savepoint will have no records for the expired old state but the meta info can linger in memory while the job is running. It will go away if the job is restarted.
For RocksDB, the column family of the old state can linger. Moreover, the background cleanup runs only during compaction. If the table is too small, like the part which is in memory, this part (maybe even a bit on disk) will linger. It will go away after restart if cleanup on full snapshot is active (not for incremental checkpoints).
All in all, it depends on how often you have to create the new state and restart your job from savepoint/checkpoint.
I created a ticket to document what can be changed in TTL config and when,
so check some details in the issue.
I guess the "old" state would be garbage collected (?).
from the Flink documentation Cleanup of Expired State.
By default, expired values are explicitly removed on read, such as
ValueState#value, and periodically garbage collected in the background
if supported by the configured state backend. Background cleanup can
be disabled in the StateTtlConfig:
import org.apache.flink.api.common.state.StateTtlConfig;
StateTtlConfig ttlConfig = StateTtlConfig
.newBuilder(Time.seconds(1))
.disableCleanupInBackground()
.build();
or execute the clean up after a full snapshot:
import org.apache.flink.api.common.state.StateTtlConfig;
import org.apache.flink.api.common.time.Time;
StateTtlConfig ttlConfig = StateTtlConfig
.newBuilder(Time.seconds(1))
.cleanupFullSnapshot()
.build();
you can change the TTL at anytime according to the documentation. However, you have to restart the query (it I snot in run-time):
For existing jobs, this cleanup strategy can be activated or
deactivated anytime in StateTtlConfig, e.g. after restart from
savepoint.
But why don'y you see the timers on RocksDB like David said on the referenced answer?

Flink add a TTL to an existing value state

For one of our Flink jobs, we found a state causing a state leak. To fix this we need to add a TTL to the state causing the leak, however, we would like to keep existing state(savepoint). If we add a TTL to a value state would we be able to use the existing savepoint? Thank you.
No, according to the docs this won't work:
Trying to restore state, which was previously configured without TTL, using TTL enabled descriptor or vice versa will lead to compatibility failure and StateMigrationException.
However, you may be able to use the state processor API to accomplish this.
However, exactly how you should handle this depends on what kind of state it is, how it was serialized, and whether the operator has a UID.

Resources