Flink local recovery state file clear - apache-flink

We are testing Flink's local-recovery option to achieve fast recovery for our large keyed state. We canceled our current running job and then restart it from the last checkpoint and we found that the previous state remained in the file system. We want to ask if the state files would not be deleted even if we already resumed it. We would not want our local task's disk usage getting unlimited increased.

The state files will not be deleted because there is an new jobid assigned to the resumed job, so Flink will create a new directory to store the checkpoint files. And it totally makes sense to me.
Let's assume that if Flink deletes the state files after recovery, what are you going to do if the program fails again?

Related

Flink-RocksDB behaviour after task manager failure

I am experimenting with my new Flink cluster(3 Different Machines-> 1 Job Manager, 2-> Task Managers) using RocksDB as State Backend however the checkpointing behaviour I am getting is a little confusing.
More specifically, I have designed a simple WordCount example and my data source is netcat. When I submit my job, the job manager assigns it to a random task manager(no replication as well). I provide some words and then I kill the currenlty running task manager. After a while, the job restarts in the other task manager and I can provide some new words. The confusing part is that state from the first task manager is preserved even when I have killed it.
To my understanding, RocksDB maintains its state in a local directory of the running task manager, so what I expected was when the first task manager was killed to lose the entire state and start counting words from the beginning. So Flink seems to somehow maintain its state in the memory(?) or broadcasts it through JobManager?
Am I missing something?
The RocksDB state backend does keep its working state on each task manager's local disk, while checkpoints are normally stored in a distributed filesystem.
If you have checkpointing enabled, then the spare task manager is able to recover the state from the latest checkpoint and resume processing.

Flink, what does method setDbStoragePath do in RocksDBStateBackend?

I'm using flink 1.11 with RocksDBStateBackend, the code looks like this:
RocksDBStateBackend stateBackend = new RocksDBStateBackend("hdfs:///flink-checkpoints", true);
stateBackend.setDbStoragePath(config.getString("/tmp/rocksdb/"));
env.setStateBackend(stateBackend);
My questions are:
My understanding is that when DbStoragePath is set, Flink will put all checkpoints and state in a local disk (in my case /tmp/rocksdb) before storing into hadoop hdfs:///flink-checkpoints. Is that right? And if it's right, should I always set DbStoragePath for better performance?
Because Flink doesn't delete old checkpoints, I have a job periodically clean up old checkpoints. But I'm not sure is it safe to do that if I set incremental checkpoints?
The DbStoragePath is the location on the local disk where RocksDB keeps its working state. By default the tmp directory will be used. Ideally this should be fastest available disk -- e.g., SSD. Normally this is configured via state.backend.rocksdb.localdir.
If you are using incremental checkpoints, then the SST files from the DbStoragePath are copied to the state.checkpoints.dir. Otherwise full snapshots are written to the checkpoint directory and the DbStoragePath isn't involved.
Flink automatically deletes old checkpoints, except after canceling a job that is using retained checkpoints. It's not obvious how to safely delete an incremental, retained checkpoint -- you need to somehow know if any of those SST files are still referred to from the latest checkpoint. You might ask for advice on the user mailing list.

Is state saved in TaskManager's memory regardless of state back end?

I know I can set the state backend both in the flink's configuration file(flink-conf.yaml) globally
or set in the per-job scope.
val env = StreamExecutionEnvironment.getExecutionEnvironment()
env.setStateBackend(new FsStateBackend("hdfs://namenode:40010/flink/checkpoints"))
I have one question here:
Where are the state data that belongs to the TasManager saved in TaskManager while the flink job keeps running? I mean that when one checkpoint is done,the checkpointed data will be saved in HDFS(chk-XXX) or RocksDB, but while the flink job keeps running, the taskManager will accumulate more and more states belonging to this task manager, are they always saved in memory?
If they are kept in memory, then the checkpoint data can't be too large,or else OOM may occur.
Can I use RocksDB in TaskManager process to save the TM's states data? Thanks!
With the FsStateBackend, the working state for each task manager is in memory (on the JVM heap), and state backups (checkpoints) go to a distributed file system, e.g., HDFS.
With the RocksDBStateBackend, the working state for each task manager is in a local RocksDB instance, i.e., on the local disk, and again, the state backups (checkpoints) go to a distributed file system, e.g., HDFS.
Flink never stores checkpoint data in RocksDB. That's not the role it plays. RocksDB is used as an ephemeral, embedded data store whose contents can be lost in the event that a task manager fails. This is an alternative to keeping the working state in memory (where it can also be lost when a task manager fails).

Restore MapState after Job restart/cancellation

I have to Aggregate the count/sum on event stream for various entities.
event logs(json str) are received from kafka and populate map entityname as key and value is count of the selective attriibutes as json str .
MapState sourceAggregationMap = getRuntimeContext().getMapState(sourceAggregationDesc);
for each event stream repopulate the value .
problem is whenever job gets stopped (failed)/cancelled and when the job gets restarted map state is not getting reinitialized / restored . again count starts from 0.
using Apache flink 1.6.0
state.backend: rocksdb
Checkpoints are used for automatic recovery from failures, and need to be explicitly enabled and configured. Savepoints are triggered manually and are used for restarts and upgrades. Both rely on the same snapshotting mechanism which is described in detail here.
These snapshots capture the entire state of the distributed pipeline, recording offsets into the input queues as well as the state throughout the job graph that has resulted from having ingested the data up to that point. When a failure occurs, the sources are rewound, the state is restored, and processing is resumed.
With the RocksDB state backend, the working state is held on the local disk (in a location you configure), and checkpoints are durably persisted to a distributed file system (again, configurable). When a job is cancelled, the checkpoints are normally deleted (as they will no longer be needed for recovery), but they can be configured to be retained. If your jobs aren't recovering their state after failures, perhaps the checkpoints are failing, or the job is failing before the first checkpoint can complete. The web ui has a section that displays information about checkpoints, and the logs should also have helpful information.
Update: see also this answer.

When are flink checkpoint files cleaned?

I have a streaming job that:
reads from Kafka --> maps events to some other DataStream --> key by(0) --> reduces a time window of 15 seconds processing time and writes back to a Redis sink.
When starting up, everything works great. The problem is, that after a while, the disk space get's full by what I think are links checkpoints.
My question is, are the checkpoints supposed to be cleaned/deleted while the link job is running? could not find any resources on this.
I'm using a filesystem backend that writes to /tmp (no hdfs setup)
Flink cleans up checkpoint files while it is running. There were some corner cases where it "forgot" to clean up all files in case of system failures.
But for Flink 1.3 the community is working on fixing all these issues.
In your case, I'm assuming that you don't have enough disk space to store the data of your windows on disk.
Checkpoints are by default not persisted externally and are only used to resume a job from failures. They are deleted when a program is cancelled.
If you are taking externalized checkpoints, then it has two policy
ExternalizedCheckpointCleanup.RETAIN_ON_CANCELLATION: Retain the externalized checkpoint when the job is cancelled. Note that you have to manually clean up the checkpoint state after cancellation in this case.
ExternalizedCheckpointCleanup.DELETE_ON_CANCELLATION: Delete the externalized checkpoint when the job is cancelled. The checkpoint state will only be available if the job fails.
For more details
https://ci.apache.org/projects/flink/flink-docs-release-1.4/ops/state/checkpoints.html

Resources