Flink checkpointing working for ProcessFunction but not for AsyncFunction - apache-flink

I have operator checkpointing enabled and working smoothly for a ProcessFunction operator.
On job failure I can see how operator state gets externalized on the snapshotState() hook, and on resume, I can see how state is restored at the initializeState() hook.
However when I try to implement the CheckpointedFunction interface and the 2 aforementioned methods on an AsyncFunction, it does not seem to work. I'm doing virtually the same as with the ProcessFunction ...but when the job is shutting down after failure, it does not seems to be stopping by the snapshotState() hook, and upon job resume, context.isRestored() is always false.
Why CheckpointedFunction.snapshotState() and CheckpointedFunction.initializeState() are not being executed with AsyncFunction but yes with ProcessFunction?
Edited:
For some reason, my checkpoints are taking very long. My config is very standard I believe, interval of 1 second, 500ms min pause, exactly once. No other tunning.
I'm getting this traces from the checkpointing coordinator
o.a.f.s.r.t.SubtaskCheckpointCoordinatorImpl - Time from receiving all checkpoint barriers/RPC to executing it exceeded threshold: 93905ms
2021-11-23 16:25:01 INFO o.a.f.r.c.CheckpointCoordinator - Completed checkpoint 4 for job 239d7967eac7900b33d7eadd483c9447 (671604 bytes in 112071 ms).
If I attempt to set a checkpointTimeout, I need to set something in the order or 5 minutes or so. How come a checkpoint of such a little state (it's just a Counter and a Long) takes 5 minutes?
I've also read that NFS volumes are a recipe for troubles, but so far I haven't run this on the cluster, I'm just testing it on my local filesystem

AsyncFunction doesn't support state at all. The reason is that state primitives are not synchronized and thus would produce incorrect results in AsyncFunction. That's the same reason why there is no KeyedAsyncFunction.
If Flink had https://cwiki.apache.org/confluence/display/FLINK/FLIP-22%3A+Eager+State+Declaration implemented then it could simply attach the state on each async call and update on successful async.
You can do some trickery with chained maps and slot sharing groups around the limitation but it's rather hacky.

Related

Can we update a state's TTL value?

We have a topology that uses states (ValueState and ListState) with TTL(StateTtlConfig) because we can not use Timers (We would generate hundred of millions of timers per day, and it does scale : a savepoint/checkpoint would take hours to be generated and might even get stuck while running).
However we need to update the value of the TTL at runtime depending of the type of some incoming events and other logic. Is this alright to recreate a new state with a new StateTtlConfig (and updated TTL time) and copy the values from "old" to "new in the processElement1() and processElement2() methods of a CoProcessFunction (instead of once in the open() like we usually do) ?
I guess the "old" state would be garbage collected (?).
Would this solution scale? be performant? generate any issue? anything bad?
I think your approach can work with the state re-creation in runtime to some extent but it is brittle. The problem, I can see, is that the old state meta information can linger somewhere depending on backend implementation.
For Heap (FS) backend, eventually the checkpoint/savepoint will have no records for the expired old state but the meta info can linger in memory while the job is running. It will go away if the job is restarted.
For RocksDB, the column family of the old state can linger. Moreover, the background cleanup runs only during compaction. If the table is too small, like the part which is in memory, this part (maybe even a bit on disk) will linger. It will go away after restart if cleanup on full snapshot is active (not for incremental checkpoints).
All in all, it depends on how often you have to create the new state and restart your job from savepoint/checkpoint.
I created a ticket to document what can be changed in TTL config and when,
so check some details in the issue.
I guess the "old" state would be garbage collected (?).
from the Flink documentation Cleanup of Expired State.
By default, expired values are explicitly removed on read, such as
ValueState#value, and periodically garbage collected in the background
if supported by the configured state backend. Background cleanup can
be disabled in the StateTtlConfig:
import org.apache.flink.api.common.state.StateTtlConfig;
StateTtlConfig ttlConfig = StateTtlConfig
.newBuilder(Time.seconds(1))
.disableCleanupInBackground()
.build();
or execute the clean up after a full snapshot:
import org.apache.flink.api.common.state.StateTtlConfig;
import org.apache.flink.api.common.time.Time;
StateTtlConfig ttlConfig = StateTtlConfig
.newBuilder(Time.seconds(1))
.cleanupFullSnapshot()
.build();
you can change the TTL at anytime according to the documentation. However, you have to restart the query (it I snot in run-time):
For existing jobs, this cleanup strategy can be activated or
deactivated anytime in StateTtlConfig, e.g. after restart from
savepoint.
But why don'y you see the timers on RocksDB like David said on the referenced answer?

Flink window operator checkpointing

I want to know how flink does the checkpoint of the window operator. How to ensure that it is exactly once when recovering? For example, saving the tuples in the current window and saving the progress of the current window processing. I want to know the detailed process of the window operator's checkpoint and recovery.
All of Flink's stateful operators participate in the same checkpointing mechanism. When instructed to do so by the checkpoint coordinator (part of the job manager), the task managers initiate a checkpoint in each parallel instance of every source operator. The sources checkpoint their offsets and insert a checkpoint barrier into the stream. This divides the stream into the parts before and after the checkpoint. The barriers flow through the graph, and each stateful operator checkpoints its state upon having processed the stream up to the checkpoint barrier. The details are described at the link shared by #bupt_ljy.
Thus these checkpoints capture the entire state of the distributed pipeline, recording offsets into the input queues as well as the state throughout the job graph that has resulted from having ingested the data up to that point. When a failure occurs, the sources are rewound, the state is restored, and processing is resumed.
Given that during recovery the sources are rewound and replayed, "exactly once" means that the state managed by Flink is affected exactly once, not that the stream elements are processed exactly once.
There's nothing particularly special about windows in this regard. Depending on the type of window function being applied, a window's contents are kept in an element of managed ListState, ReducingState, AggregatingState, or FoldingState. As stream elements arrive and are being assigned to a window, they are appended, reduced, aggregated, or folded into that state. Other components of the window API, including Triggers and ProcessWindowFunctions, can have state that is checkpointed as well. For example, CountTrigger using ReducingState to keep track of how many elements have been assigned to the window, adding one to the count as each element is added to the window.
In the case where the window function is a ProcessWindowFunction, all of the elements assigned to the window are saved in Flink state, and are passed in an Iterable to the ProcessWindowFunction when the window is triggered. That function iterates over the contents and produces a result. The internal state of the ProcessWindowFunction is not checkpointed; if the job fails during the execution of the ProcessWindowFunction the job will resume from the most recently completed checkpoint. This will involve rewinding back to a time before the window received the event that triggered the window firing (that event can't be included in the checkpoint because a checkpoint barrier following it can not have had its effect yet). Sooner or later the window will again reach the point where it is triggered and the ProcessWindowFunction will be called again -- with the same window contents it received the first time -- and hopefully this time it won't fail. (Note that I've ignored the case of processing-time windows, which do not behave deterministically.)
When a ProcessWindowFunction uses managed/checkpointed state, it is used to remember things between firings, not within a single firing. For example, a window that allows late events might want to store the result previously reported, and then issue an update for each late event.

Managed Keyed State not restoring with Check pointing Enabled

EDIT2: skip to the end, state is restored but it's not queryable new tl;dr "How do I make State that was queryable, still queryable after a restore from checkpoint?"
I have a keyed stream with check pointing enabled similar to this (I've tried this with in memory as well as HDFS with the same results)
env.enableCheckpointing(60000)
env.setStateBackend(new FsStateBackend("file:///flink-test"))
val stream = env.addSource(consumer)
.flatMap(new ValidationMap()).name("ValidationMap")
.keyBy(x => new Tuple3[String, String, String](x.account(), x.organization(), x.`type`()))
.flatMap(new Foo()).name(jobname)
Within this stream, I have a Managed Keyed State ValueState that I set as queryable.
val newValueStateDescriptor = new ValueStateDescriptor[java.util.ArrayList[java.util.ArrayList[Long]]]("foo", classOf[java.util.ArrayList[java.util.ArrayList[Long]]])
newValueStateDescriptor.setQueryable("foo")
valueState = getRuntimeContext.getState[java.util.ArrayList[java.util.ArrayList[Long]]](newValueStateDescriptor)
valueState.update(new java.util.ArrayList[java.util.ArrayList[Long]]())
This list is periodically appended to or removed from and the valueState is updated. When I make a request of the Queryable State I currently see correct values.
In my JobManager log I see check pointing every minute, and when I check the file system, I see files being created that are non-empty.
My setup has 3 JobManagers (2 in standby), 3 TaskManagers (all 3 in use).
I put a single data point into the system and read it out of QueryableState, everything looks good. Then I pick a single TaskManager (not even the one that processed the data, any of the 3) and I kill it, then restart it to simulate a crash.
I watch the job get retried 2 or 3 times until the TaskManager comes back online, and finally I see the same JobID running again in Flink, life seems good.
But, I then hit the Queryable State again, and I get an UnknownKvStateLocation exception.
I'm really not quite sure what I've done wrong here, things appear to be check pointing, but I never manage to get my ValueState back ? Maybe it's back but not Queryable?
EDIT:
Log snippet from JobManager implies things are restored
{"level":"INFO","time":"2017-06-01 15:30:02,332","class":"org.apache.flink.runtime.executiongraph.ExecutionGraph","ndc":"", "message":"Job Foo (dc7850a6866f181c2f07968d35fe3d46) switched from state RESTARTING to CREATED."}
{"level":"INFO","time":"2017-06-01 15:30:02,332","class":"org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore","ndc":"", "message":"Recovering checkpoints from ZooKeeper."}
{"level":"INFO","time":"2017-06-01 15:30:02,333","class":"org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore","ndc":"", "message":"Found 1 checkpoints in ZooKeeper."}
{"level":"INFO","time":"2017-06-01 15:30:02,333","class":"org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore","ndc":"", "message":"Trying to retrieve checkpoint 5."}
{"level":"INFO","time":"2017-06-01 15:30:02,340","class":"org.apache.flink.runtime.checkpoint.CheckpointCoordinator","ndc":"", "message":"Restoring from latest valid checkpoint: Checkpoint 5 # 1496330912627 for dc7850a6866f181c2f07968d35fe3d46."}
{"level":"INFO","time":"2017-06-01 15:30:02,340","class":"org.apache.flink.runtime.executiongraph.ExecutionGraph","ndc":"", "message":"Job Foo (dc7850a6866f181c2f07968d35fe3d46) switched from state CREATED to RUNNING."}
It really looks like it's restored, and when I inspect the file created in /flink-test I see some binary data but it contains the identifying names for my Queryable State ValueState. Any ideas on what to look for would be welcome.
EDIT2: State >is< restored, it's just not queryable!
The fact that a given piece of registered state has been made queryable is not (currently) part of what Flink records in checkpoints or savepoints. So after recovery, the state isn't queryable until a new StateDescriptor is provided.
For more, see this discussion on the flink-users mailing list.

Libev: how to schedule a callback to be called as soon as possible

I'm learning libev and I've stumbled upon this question. Assume that I want to process something as soon as possible but not now (i.e. not in the current executing function). For example I want to divide some big synchronous job into multiple pieces that will be queued so that other callbacks can fire in between. In other words I want to schedule a callback with timeout 0.
So the first idea is to use ev_timer with timeout 0. The first question is: is that efficient? Is libev capable of transforming 0 timeout timer into an efficient "call as soon as possible" job? I assume it is not.
I've been digging through libev's docs and I found other options as well:
it can artificially delay invoking the callback by using a prepare or idle watcher
So the idle watcher is probably not going to be good here because
Idle watchers trigger events when no other events of the same or higher priority are pending
Which probably is not what I want. Prepare watchers might work here. But why not check watcher? Is there any crutial difference in the context I'm talking about?
The other option these docs suggest is:
or more sneakily, by reusing an existing (stopped) watcher and pushing it into the pending queue:
ev_set_cb (watcher, callback);
ev_feed_event (EV_A_ watcher, 0);
But that would require to always have a stopped watcher. Also since I don't know a priori how many calls I want to schedule at the same time then I would have to have multiple watchers and additionally keep track of them via some kind of list and increase it when needed.
So am I on the right track? Are these all possibilities or am I missing something simple?
You may want to check out the ev_prepare watcher. That one is scheduled for execution as the last handler in the given event loop iteration. It can be used for 'Execute this task ASAP' implementations. You can create dedicated watcher for each task you want to execute, or you can implement a queue with a one prepare watcher that is started once queue contains at least one task.
Alternatively, you can implement similar mechanism using ev_idle watcher, but this time, it will be executed only if the application doesn't process any 'higher priority' watcher handlers.

libspotify playlist update latency

We're using libspotify to update playlists that we have generated against a single account that need to be kept up to date over time. We're using a fork of the spotify-api-server to do this https://github.com/tom-martin/spotify-api-server
After sending an update to a playlist's tracks using libspotify we generally wait for the callback that we passed to sp_playlist_add_callbacks to be called before we report a success to the user. Often this callback arrives within a suitable time frame but increasingly we're getting unacceptable delays in receiving this callback. Sometimes 30 seconds, sometimes even longer, sometimes minutes, sometimes hours. It seems that generally these delays are caused by libspotify pausing for a period and not calling any callbacks until it seemingly "unfreezes" and calls all the backed up callbacks in quick succession.
Is it reasonable to use this callback as an indicator of a successful playlist update? Is there any obvious reason for these long delays?
Are you correctly handling the notify_main_thread function to keep libSpotify running?
Also, sometimes the playlist system gets backed up, goes down or otherwise takes a while to respond to requests. Our own clients keep their own cache of what the playlist tree should look like once pending transactions are successful to keep the UI snappy.

Resources