Restore MapState after Job restart/cancellation - apache-flink

I have to Aggregate the count/sum on event stream for various entities.
event logs(json str) are received from kafka and populate map entityname as key and value is count of the selective attriibutes as json str .
MapState sourceAggregationMap = getRuntimeContext().getMapState(sourceAggregationDesc);
for each event stream repopulate the value .
problem is whenever job gets stopped (failed)/cancelled and when the job gets restarted map state is not getting reinitialized / restored . again count starts from 0.
using Apache flink 1.6.0
state.backend: rocksdb

Checkpoints are used for automatic recovery from failures, and need to be explicitly enabled and configured. Savepoints are triggered manually and are used for restarts and upgrades. Both rely on the same snapshotting mechanism which is described in detail here.
These snapshots capture the entire state of the distributed pipeline, recording offsets into the input queues as well as the state throughout the job graph that has resulted from having ingested the data up to that point. When a failure occurs, the sources are rewound, the state is restored, and processing is resumed.
With the RocksDB state backend, the working state is held on the local disk (in a location you configure), and checkpoints are durably persisted to a distributed file system (again, configurable). When a job is cancelled, the checkpoints are normally deleted (as they will no longer be needed for recovery), but they can be configured to be retained. If your jobs aren't recovering their state after failures, perhaps the checkpoints are failing, or the job is failing before the first checkpoint can complete. The web ui has a section that displays information about checkpoints, and the logs should also have helpful information.
Update: see also this answer.

Related

Apache Flink: restoring state from checkpoint with changes Kafka topic

I faced with unexpected behavior when need start job from checkpoint and change Kafka topic. In this case Flink restore state for Kafka Consumer with early defined topic, last committed offset and consumer group id, as a result, Kafka Consumer starts consuming messages from two topics, the former one, which was restored from the state and the new one, defined in the configuration at the start of the job.
It's very confusing, and in the end, it's not entirely clear if it's a bug or a feature? Is there a way to manage recovery jobs from a checkpoint and at the same time not restore the state of Kafka consumers, but instead use the parameters from the configuration to initialize them?
I need a previous job state, but I want to get new data from another topic!
If you change the UID of the KafkaSource (or FlinkKafkaConsumer) and restart the job with allowNonRestoredState enabled, then you'll get the behavior you are looking for.
Changing the UID (or setting one, if you haven't explicitly set one) will prevent the saved Kafka offsets from being restored, and allowNonRestoredState will override Flink's built-in protections against losing state.

Apache Flink Checkpoining (Manually put a value into RocksDB Checkpoint and retrieve during recovery or Restart)

We have a scenario where we have to persist/save some value into the checkpoint and retrieve it back during failure recovery/application restart.
We followed a few things like ValueState, ValueStateDescriptor still not working.
https://github.com/realtime-storage-engine/flink-spillable-statebackend/blob/master/flink-spillable-benchmark/src/main/java/org/apache/flink/spillable/benchmark/WordCount.java
https://towardsdatascience.com/heres-how-flink-stores-your-state-7b37fbb60e1a
https://github.com/king/flink-state-cache/blob/master/examples/src/main/java/com/king/flink/state/Example.java
We can't externalize it to a DB as it may cause some performance issues.
Any lead to this will be helpful using checkpoint. How to put and get back from a Checkpoint?
All of your managed application state is automatically written into Flink checkpoints (and savepoints). This includes
keyed state (ValueState, ListState, MapState, etc)
operator state (ListState, BroadcastState, etc)
timers
This state is automatically restored during recovery, and can optionally be restored during manual restarts.
The Flink Operations Playground shows how to work with checkpoints and savepoints, and lets you observe their behavior during failure/recovery and restarts/rescaling.
If you want to read from a checkpoint yourself, that's what the State Processor API is for. Here's an example.

Flink - Lazy start with operators working during savepoint startup

I am using Apache Flink with RocksDBStateBackend and going through some trouble when the job is restarted using a savepoint.
Apparently, it takes some time for the state to be ready again, but even though the state isn't ready yet, DataStreams from Kafka seems to be moving data around, which causes some invalid misses as the state isn't ready yet for my KeyedProcessFunction.
Is it the expected behavior? I couldn't find anything in the documentation, and apparently, no related configuration.
The ideal for us would be to have the state fully ready to be queried before any data is moved.
For example, this shows that during a deployment, the estimate_num_keys metric was slowly increasing.
However, if we look at an application counter from an operator, they were working during that "warm-up phase".
I found some discussion here Apache flink: Lazy load from save point for RocksDB backend where it was suggested to use Externalized Checkpoints.
I will look into it, but currently, our state isn't too big (~150 GB), so I am not sure if that is the only path to try.
Starting a Flink job that uses RocksDB from a savepoint is an expensive operation, as all of the state must first be loaded from the savepoint into new RocksDB instances. On the other hand, if you use a retained, incremental checkpoint, then the SST files in that checkpoint can be used directly by RocksDB, leading to must faster start-up times.
But, while it's normal for starting from a savepoint to be expensive, this shouldn't lead to any errors or dropped data.

Is state saved in TaskManager's memory regardless of state back end?

I know I can set the state backend both in the flink's configuration file(flink-conf.yaml) globally
or set in the per-job scope.
val env = StreamExecutionEnvironment.getExecutionEnvironment()
env.setStateBackend(new FsStateBackend("hdfs://namenode:40010/flink/checkpoints"))
I have one question here:
Where are the state data that belongs to the TasManager saved in TaskManager while the flink job keeps running? I mean that when one checkpoint is done,the checkpointed data will be saved in HDFS(chk-XXX) or RocksDB, but while the flink job keeps running, the taskManager will accumulate more and more states belonging to this task manager, are they always saved in memory?
If they are kept in memory, then the checkpoint data can't be too large,or else OOM may occur.
Can I use RocksDB in TaskManager process to save the TM's states data? Thanks!
With the FsStateBackend, the working state for each task manager is in memory (on the JVM heap), and state backups (checkpoints) go to a distributed file system, e.g., HDFS.
With the RocksDBStateBackend, the working state for each task manager is in a local RocksDB instance, i.e., on the local disk, and again, the state backups (checkpoints) go to a distributed file system, e.g., HDFS.
Flink never stores checkpoint data in RocksDB. That's not the role it plays. RocksDB is used as an ephemeral, embedded data store whose contents can be lost in the event that a task manager fails. This is an alternative to keeping the working state in memory (where it can also be lost when a task manager fails).

How does Flink make checkpoint asynchronously with RocksDB backend

I am using Flink with RocksDB. From the document of Flink I acknowledge that Flink will make checkpoint asynchronously when using RocksDB backend. See the descriptions in its doc.
It is possible to let an operator continue processing while it stores its state snapshot, effectively letting the state snapshots happen asynchronously in the background. To do that, the operator must be able to produce a state object that should be stored in a way such that further modifications to the operator state do not affect that state object. For example, copy-on-write data structures, such as are used in RocksDB, have this behavior.
From my understanding, when a checkpoint need to be make, an operator will do these steps for Rocksdb:
Flush data in memtable
Copy the db folder into another tmp folder, which contains all the data in RocksDB
Upload the copied data to remote Fs-system. (In this step, it is asynchronous)
Is my understanding right ? Or could anyone help to illustrate the details ?
Thanks a lot because I cannot find any documentation to describe the details.
Found one Blog where mentioned the process:
To do this, Flink triggers a flush in RocksDB, forcing all memtables into sstables on disk, and hard-linked in a local temporary directory. This process is synchronous to the processing pipeline, and Flink performs all further steps asynchronously and does not block processing.
See the link for more details: Incremental Checkpoint

Resources