Managed Keyed State not restoring with Check pointing Enabled - apache-flink

EDIT2: skip to the end, state is restored but it's not queryable new tl;dr "How do I make State that was queryable, still queryable after a restore from checkpoint?"
I have a keyed stream with check pointing enabled similar to this (I've tried this with in memory as well as HDFS with the same results)
env.enableCheckpointing(60000)
env.setStateBackend(new FsStateBackend("file:///flink-test"))
val stream = env.addSource(consumer)
.flatMap(new ValidationMap()).name("ValidationMap")
.keyBy(x => new Tuple3[String, String, String](x.account(), x.organization(), x.`type`()))
.flatMap(new Foo()).name(jobname)
Within this stream, I have a Managed Keyed State ValueState that I set as queryable.
val newValueStateDescriptor = new ValueStateDescriptor[java.util.ArrayList[java.util.ArrayList[Long]]]("foo", classOf[java.util.ArrayList[java.util.ArrayList[Long]]])
newValueStateDescriptor.setQueryable("foo")
valueState = getRuntimeContext.getState[java.util.ArrayList[java.util.ArrayList[Long]]](newValueStateDescriptor)
valueState.update(new java.util.ArrayList[java.util.ArrayList[Long]]())
This list is periodically appended to or removed from and the valueState is updated. When I make a request of the Queryable State I currently see correct values.
In my JobManager log I see check pointing every minute, and when I check the file system, I see files being created that are non-empty.
My setup has 3 JobManagers (2 in standby), 3 TaskManagers (all 3 in use).
I put a single data point into the system and read it out of QueryableState, everything looks good. Then I pick a single TaskManager (not even the one that processed the data, any of the 3) and I kill it, then restart it to simulate a crash.
I watch the job get retried 2 or 3 times until the TaskManager comes back online, and finally I see the same JobID running again in Flink, life seems good.
But, I then hit the Queryable State again, and I get an UnknownKvStateLocation exception.
I'm really not quite sure what I've done wrong here, things appear to be check pointing, but I never manage to get my ValueState back ? Maybe it's back but not Queryable?
EDIT:
Log snippet from JobManager implies things are restored
{"level":"INFO","time":"2017-06-01 15:30:02,332","class":"org.apache.flink.runtime.executiongraph.ExecutionGraph","ndc":"", "message":"Job Foo (dc7850a6866f181c2f07968d35fe3d46) switched from state RESTARTING to CREATED."}
{"level":"INFO","time":"2017-06-01 15:30:02,332","class":"org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore","ndc":"", "message":"Recovering checkpoints from ZooKeeper."}
{"level":"INFO","time":"2017-06-01 15:30:02,333","class":"org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore","ndc":"", "message":"Found 1 checkpoints in ZooKeeper."}
{"level":"INFO","time":"2017-06-01 15:30:02,333","class":"org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore","ndc":"", "message":"Trying to retrieve checkpoint 5."}
{"level":"INFO","time":"2017-06-01 15:30:02,340","class":"org.apache.flink.runtime.checkpoint.CheckpointCoordinator","ndc":"", "message":"Restoring from latest valid checkpoint: Checkpoint 5 # 1496330912627 for dc7850a6866f181c2f07968d35fe3d46."}
{"level":"INFO","time":"2017-06-01 15:30:02,340","class":"org.apache.flink.runtime.executiongraph.ExecutionGraph","ndc":"", "message":"Job Foo (dc7850a6866f181c2f07968d35fe3d46) switched from state CREATED to RUNNING."}
It really looks like it's restored, and when I inspect the file created in /flink-test I see some binary data but it contains the identifying names for my Queryable State ValueState. Any ideas on what to look for would be welcome.
EDIT2: State >is< restored, it's just not queryable!

The fact that a given piece of registered state has been made queryable is not (currently) part of what Flink records in checkpoints or savepoints. So after recovery, the state isn't queryable until a new StateDescriptor is provided.
For more, see this discussion on the flink-users mailing list.

Related

Flink checkpointing working for ProcessFunction but not for AsyncFunction

I have operator checkpointing enabled and working smoothly for a ProcessFunction operator.
On job failure I can see how operator state gets externalized on the snapshotState() hook, and on resume, I can see how state is restored at the initializeState() hook.
However when I try to implement the CheckpointedFunction interface and the 2 aforementioned methods on an AsyncFunction, it does not seem to work. I'm doing virtually the same as with the ProcessFunction ...but when the job is shutting down after failure, it does not seems to be stopping by the snapshotState() hook, and upon job resume, context.isRestored() is always false.
Why CheckpointedFunction.snapshotState() and CheckpointedFunction.initializeState() are not being executed with AsyncFunction but yes with ProcessFunction?
Edited:
For some reason, my checkpoints are taking very long. My config is very standard I believe, interval of 1 second, 500ms min pause, exactly once. No other tunning.
I'm getting this traces from the checkpointing coordinator
o.a.f.s.r.t.SubtaskCheckpointCoordinatorImpl - Time from receiving all checkpoint barriers/RPC to executing it exceeded threshold: 93905ms
2021-11-23 16:25:01 INFO o.a.f.r.c.CheckpointCoordinator - Completed checkpoint 4 for job 239d7967eac7900b33d7eadd483c9447 (671604 bytes in 112071 ms).
If I attempt to set a checkpointTimeout, I need to set something in the order or 5 minutes or so. How come a checkpoint of such a little state (it's just a Counter and a Long) takes 5 minutes?
I've also read that NFS volumes are a recipe for troubles, but so far I haven't run this on the cluster, I'm just testing it on my local filesystem
AsyncFunction doesn't support state at all. The reason is that state primitives are not synchronized and thus would produce incorrect results in AsyncFunction. That's the same reason why there is no KeyedAsyncFunction.
If Flink had https://cwiki.apache.org/confluence/display/FLINK/FLIP-22%3A+Eager+State+Declaration implemented then it could simply attach the state on each async call and update on successful async.
You can do some trickery with chained maps and slot sharing groups around the limitation but it's rather hacky.

Flink window aggregation with state

I would like to do a window aggregation with an early trigger logic (you can think that the aggregation is triggered either by window is closed, or by a specific event), and I read on the doc: https://ci.apache.org/projects/flink/flink-docs-release-1.12/dev/stream/operators/windows.html#incremental-window-aggregation-with-aggregatefunction
The doc mentioned that Note that using ProcessWindowFunction for simple aggregates such as count is quite inefficient. so the suggestion is to pair with incremental window aggregation.
My question is that AverageAggregate in the doc, the state is not saved anywhere, so if the application crashed, the averageAggregate will loose all the intermediate value, right?
So If that is the case, is there a way to do a window aggregation, still supports incremental aggregation, and has a state backend to recover from crash?
The AggregateFunction is indeed only describing the mechanism for combining the input events into some result, that specific class does not store any data.
The state is persisted for us by Flink behind the scene though, when we write something like this:
input
.keyBy(<key selector>)
.window(<window assigner>)
.aggregate(new AverageAggregate(), new MyProcessWindowFunction());
the .keyBy(<key selector>).window(<window assigner>) is indicating to Flink to hold a piece of state for us for each key and time bucket, and to call our code in AverageAggregate() and MyProcessWindowFunction() when relevant.
In case of crash or restart, no data is lost (assuming state backend are configured properly): as with other parts of Flink state, the state here will either be retrieved from the state backend or recomputed from first principles from upstream data.

About StateTtlConfig

I'm configuring my StateTtlConfig for MapState and my interest is the objects into the state has for example 3 hours of life and then they should disappear from state and passed to the GC to be cleaned up and release some memory and the checkpoints should release some weight too I think. I had this configuration before and it seems like it was not working because the checkpoints where always growing up:
private final StateTtlConfig ttlConfig = StateTtlConfig.newBuilder(org.apache.flink.api.common.time.Time.hours(3)).cleanupFullSnapshot().build();
Then I realized the that configuration works only when reading states from a savepoints but not in my scenario. I'd change my TTL configuration to this one:
private final StateTtlConfig ttlConfig = StateTtlConfig.newBuilder(org.apache.flink.api.common.time.Time.hours(3))
.setStateVisibility(StateTtlConfig.StateVisibility.NeverReturnExpired).build();
Based on the idea that I want to clean all the states for all keys after a defined time.
My questions are:
I'm I doing the right configuration right now?
What is the best way to do it?
Thanks one more time.
Kind regards!!!
I don't know enough about your use case to recommend a specific expiration/cleanup policy, but I can offer a few notes.
My understanding is that cleanupFullSnapshot() specifies that in addition to whatever other cleanup is being done, a full cleanup will be done whenever taking a snapshot.
The FsStateBackend uses the incremental cleanup strategy. By default it checks 5 entries during each state access, and does no additional cleanup during record processing. If your workload is such that there are many more writes than reads, that might not be enough. If no access happens to the state, expired state will persist. Choosing cleanupIncrementally(10, false) will make the cleanup more aggressive, assuming you do have some level of state access going on.
It's not unusual for checkpoint sizes to grow, or to take longer than you'd expect to reach a plateau. Could it simply be that the keyspace is growing?
https://flink.apache.org/2019/05/19/state-ttl.html is a good resource for learning more about Flink's State TTL mechanism.

Can we update a state's TTL value?

We have a topology that uses states (ValueState and ListState) with TTL(StateTtlConfig) because we can not use Timers (We would generate hundred of millions of timers per day, and it does scale : a savepoint/checkpoint would take hours to be generated and might even get stuck while running).
However we need to update the value of the TTL at runtime depending of the type of some incoming events and other logic. Is this alright to recreate a new state with a new StateTtlConfig (and updated TTL time) and copy the values from "old" to "new in the processElement1() and processElement2() methods of a CoProcessFunction (instead of once in the open() like we usually do) ?
I guess the "old" state would be garbage collected (?).
Would this solution scale? be performant? generate any issue? anything bad?
I think your approach can work with the state re-creation in runtime to some extent but it is brittle. The problem, I can see, is that the old state meta information can linger somewhere depending on backend implementation.
For Heap (FS) backend, eventually the checkpoint/savepoint will have no records for the expired old state but the meta info can linger in memory while the job is running. It will go away if the job is restarted.
For RocksDB, the column family of the old state can linger. Moreover, the background cleanup runs only during compaction. If the table is too small, like the part which is in memory, this part (maybe even a bit on disk) will linger. It will go away after restart if cleanup on full snapshot is active (not for incremental checkpoints).
All in all, it depends on how often you have to create the new state and restart your job from savepoint/checkpoint.
I created a ticket to document what can be changed in TTL config and when,
so check some details in the issue.
I guess the "old" state would be garbage collected (?).
from the Flink documentation Cleanup of Expired State.
By default, expired values are explicitly removed on read, such as
ValueState#value, and periodically garbage collected in the background
if supported by the configured state backend. Background cleanup can
be disabled in the StateTtlConfig:
import org.apache.flink.api.common.state.StateTtlConfig;
StateTtlConfig ttlConfig = StateTtlConfig
.newBuilder(Time.seconds(1))
.disableCleanupInBackground()
.build();
or execute the clean up after a full snapshot:
import org.apache.flink.api.common.state.StateTtlConfig;
import org.apache.flink.api.common.time.Time;
StateTtlConfig ttlConfig = StateTtlConfig
.newBuilder(Time.seconds(1))
.cleanupFullSnapshot()
.build();
you can change the TTL at anytime according to the documentation. However, you have to restart the query (it I snot in run-time):
For existing jobs, this cleanup strategy can be activated or
deactivated anytime in StateTtlConfig, e.g. after restart from
savepoint.
But why don'y you see the timers on RocksDB like David said on the referenced answer?

Flink window operator checkpointing

I want to know how flink does the checkpoint of the window operator. How to ensure that it is exactly once when recovering? For example, saving the tuples in the current window and saving the progress of the current window processing. I want to know the detailed process of the window operator's checkpoint and recovery.
All of Flink's stateful operators participate in the same checkpointing mechanism. When instructed to do so by the checkpoint coordinator (part of the job manager), the task managers initiate a checkpoint in each parallel instance of every source operator. The sources checkpoint their offsets and insert a checkpoint barrier into the stream. This divides the stream into the parts before and after the checkpoint. The barriers flow through the graph, and each stateful operator checkpoints its state upon having processed the stream up to the checkpoint barrier. The details are described at the link shared by #bupt_ljy.
Thus these checkpoints capture the entire state of the distributed pipeline, recording offsets into the input queues as well as the state throughout the job graph that has resulted from having ingested the data up to that point. When a failure occurs, the sources are rewound, the state is restored, and processing is resumed.
Given that during recovery the sources are rewound and replayed, "exactly once" means that the state managed by Flink is affected exactly once, not that the stream elements are processed exactly once.
There's nothing particularly special about windows in this regard. Depending on the type of window function being applied, a window's contents are kept in an element of managed ListState, ReducingState, AggregatingState, or FoldingState. As stream elements arrive and are being assigned to a window, they are appended, reduced, aggregated, or folded into that state. Other components of the window API, including Triggers and ProcessWindowFunctions, can have state that is checkpointed as well. For example, CountTrigger using ReducingState to keep track of how many elements have been assigned to the window, adding one to the count as each element is added to the window.
In the case where the window function is a ProcessWindowFunction, all of the elements assigned to the window are saved in Flink state, and are passed in an Iterable to the ProcessWindowFunction when the window is triggered. That function iterates over the contents and produces a result. The internal state of the ProcessWindowFunction is not checkpointed; if the job fails during the execution of the ProcessWindowFunction the job will resume from the most recently completed checkpoint. This will involve rewinding back to a time before the window received the event that triggered the window firing (that event can't be included in the checkpoint because a checkpoint barrier following it can not have had its effect yet). Sooner or later the window will again reach the point where it is triggered and the ProcessWindowFunction will be called again -- with the same window contents it received the first time -- and hopefully this time it won't fail. (Note that I've ignored the case of processing-time windows, which do not behave deterministically.)
When a ProcessWindowFunction uses managed/checkpointed state, it is used to remember things between firings, not within a single firing. For example, a window that allows late events might want to store the result previously reported, and then issue an update for each late event.

Resources