Flink, what does method setDbStoragePath do in RocksDBStateBackend? - apache-flink

I'm using flink 1.11 with RocksDBStateBackend, the code looks like this:
RocksDBStateBackend stateBackend = new RocksDBStateBackend("hdfs:///flink-checkpoints", true);
stateBackend.setDbStoragePath(config.getString("/tmp/rocksdb/"));
env.setStateBackend(stateBackend);
My questions are:
My understanding is that when DbStoragePath is set, Flink will put all checkpoints and state in a local disk (in my case /tmp/rocksdb) before storing into hadoop hdfs:///flink-checkpoints. Is that right? And if it's right, should I always set DbStoragePath for better performance?
Because Flink doesn't delete old checkpoints, I have a job periodically clean up old checkpoints. But I'm not sure is it safe to do that if I set incremental checkpoints?

The DbStoragePath is the location on the local disk where RocksDB keeps its working state. By default the tmp directory will be used. Ideally this should be fastest available disk -- e.g., SSD. Normally this is configured via state.backend.rocksdb.localdir.
If you are using incremental checkpoints, then the SST files from the DbStoragePath are copied to the state.checkpoints.dir. Otherwise full snapshots are written to the checkpoint directory and the DbStoragePath isn't involved.
Flink automatically deletes old checkpoints, except after canceling a job that is using retained checkpoints. It's not obvious how to safely delete an incremental, retained checkpoint -- you need to somehow know if any of those SST files are still referred to from the latest checkpoint. You might ask for advice on the user mailing list.

Related

Flink - Lazy start with operators working during savepoint startup

I am using Apache Flink with RocksDBStateBackend and going through some trouble when the job is restarted using a savepoint.
Apparently, it takes some time for the state to be ready again, but even though the state isn't ready yet, DataStreams from Kafka seems to be moving data around, which causes some invalid misses as the state isn't ready yet for my KeyedProcessFunction.
Is it the expected behavior? I couldn't find anything in the documentation, and apparently, no related configuration.
The ideal for us would be to have the state fully ready to be queried before any data is moved.
For example, this shows that during a deployment, the estimate_num_keys metric was slowly increasing.
However, if we look at an application counter from an operator, they were working during that "warm-up phase".
I found some discussion here Apache flink: Lazy load from save point for RocksDB backend where it was suggested to use Externalized Checkpoints.
I will look into it, but currently, our state isn't too big (~150 GB), so I am not sure if that is the only path to try.
Starting a Flink job that uses RocksDB from a savepoint is an expensive operation, as all of the state must first be loaded from the savepoint into new RocksDB instances. On the other hand, if you use a retained, incremental checkpoint, then the SST files in that checkpoint can be used directly by RocksDB, leading to must faster start-up times.
But, while it's normal for starting from a savepoint to be expensive, this shouldn't lead to any errors or dropped data.

Flink: local dir for state.checkpoints.dir

I was trying to understand the implications of using local dir e.g. file:///checkpoints/ for state.checkpoints.dir. My confusion is that 1) there might be multiple TaskManagers, does that mean each will save its own checkpoints to its local disk? 2) does this work in the environment like Kubernetes? because Pods might be moved around in the cluster.
This won't work. state.checkpoints.dir must be a URI that is accessible to every machine in the cluster, i.e., some sort of distributed filesystem. This is necessary for recovery in situations in which a task manager has failed, or when state needs to be redistributed for rescaling.
You may also want each TaskManager to additionally keep a copy of its state locally for faster recovery; see Task Local Recovery for info on that option.

Flink local recovery state file clear

We are testing Flink's local-recovery option to achieve fast recovery for our large keyed state. We canceled our current running job and then restart it from the last checkpoint and we found that the previous state remained in the file system. We want to ask if the state files would not be deleted even if we already resumed it. We would not want our local task's disk usage getting unlimited increased.
The state files will not be deleted because there is an new jobid assigned to the resumed job, so Flink will create a new directory to store the checkpoint files. And it totally makes sense to me.
Let's assume that if Flink deletes the state files after recovery, what are you going to do if the program fails again?

Apache Flink - Difference between Checkpoints & Save points?

Can someone please help me understand the difference between Apache Flink's Checkpoints & Savepoints.
While i read the documentation, couldn't understand the difference! :s
Apache Flink's Checkpoints and Savepoints are similar in that way they both are mechanisms for preserving internal state of Flink's applications.
Checkpoints are taken automatically and are used for automatic restarting job in case of a failure.
Savepoints on the other hand are taken manually, are always stored externally and are used for starting a "new" job with previous internal state in case of e.g.
bug fixing
flink version upgrade
A/B testing, etc.
Underneath they are in fact the same mechanism/code path with some subtle nuances.
Edit:
You can also find a very good explanation in the official documentation https://ci.apache.org/projects/flink/flink-docs-stable/ops/state/savepoints.html#what-is-a-savepoint-how-is-a-savepoint-different-from-a-checkpoint :
A Savepoint is a consistent image of the execution state of a streaming job, created via Flink’s checkpointing mechanism. You can use Savepoints to stop-and-resume, fork, or update your Flink jobs. Savepoints consist of two parts: a directory with (typically large) binary files on stable storage (e.g. HDFS, S3, …) and a (relatively small) meta data file. The files on stable storage represent the net data of the job’s execution state image. The meta data file of a Savepoint contains (primarily) pointers to all files on stable storage that are part of the Savepoint, in form of absolute paths.
Attention: In order to allow upgrades between programs and Flink versions, it is important to check out the following section about assigning IDs to your operators.
Conceptually, Flink’s Savepoints are different from Checkpoints in a similar way that backups are different from recovery logs in traditional database systems. The primary purpose of Checkpoints is to provide a recovery mechanism in case of unexpected job failures. A Checkpoint’s lifecycle is managed by Flink, i.e. a Checkpoint is created, owned, and released by Flink - without user interaction. As a method of recovery and being periodically triggered, two main design goals for the Checkpoint implementation are i) being as lightweight to create and ii) being as fast to restore from as possible. Optimizations towards those goals can exploit certain properties, e.g. that the job code doesn’t change between the execution attempts. Checkpoints are usually dropped after the job was terminated by the user (except if explicitly configured as retained Checkpoints).
In contrast to all this, Savepoints are created, owned, and deleted by the user. Their use-case is for planned, manual backup and resume. For example, this could be an update of your Flink version, changing your job graph, changing parallelism, forking a second job like for a red/blue deployment, and so on. Of course, Savepoints must survive job termination. Conceptually, Savepoints can be a bit more expensive to produce and restore and focus more on portability and support for the previously mentioned changes to the job.
Those conceptual differences aside, the current implementations of Checkpoints and Savepoints are basically using the same code and produce the same format. However, there is currently one exception from this, and we might introduce more differences in the future. The exception are incremental checkpoints with the RocksDB state backend. They are using some RocksDB internal format instead of Flink’s native savepoint format. This makes them the first instance of a more lightweight checkpointing mechanism, compared to Savepoints.
Savepoints
Savepoints usually apply to an individual transaction; it marks a
point to which the transaction can be rolled back, so subsequent
changes can be undone if necessary.
More See Here
https://ci.apache.org/projects/flink/flink-docs-release-1.2/setup/cli.html#savepoints
Checkpoints
Checkpoints usually apply to whole systems, You can configure periodic checkpoints to be persisted externally. Externalized checkpoints write their meta data out to persistent storage and are not automatically cleaned up when the job fails.
More See Here:
https://ci.apache.org/projects/flink/flink-docs-release-1.2/setup/checkpoints.html
On difference I would like to add is savepoint can be manually applied when we upgrade the pipeline vs checkpoint kicks in as useful in case the pipeline restarts or crashes abruptly. However, there could be side effects to later where application(pipeline) has to handle any scenarios like re-processing duplicate data etc.

When are flink checkpoint files cleaned?

I have a streaming job that:
reads from Kafka --> maps events to some other DataStream --> key by(0) --> reduces a time window of 15 seconds processing time and writes back to a Redis sink.
When starting up, everything works great. The problem is, that after a while, the disk space get's full by what I think are links checkpoints.
My question is, are the checkpoints supposed to be cleaned/deleted while the link job is running? could not find any resources on this.
I'm using a filesystem backend that writes to /tmp (no hdfs setup)
Flink cleans up checkpoint files while it is running. There were some corner cases where it "forgot" to clean up all files in case of system failures.
But for Flink 1.3 the community is working on fixing all these issues.
In your case, I'm assuming that you don't have enough disk space to store the data of your windows on disk.
Checkpoints are by default not persisted externally and are only used to resume a job from failures. They are deleted when a program is cancelled.
If you are taking externalized checkpoints, then it has two policy
ExternalizedCheckpointCleanup.RETAIN_ON_CANCELLATION: Retain the externalized checkpoint when the job is cancelled. Note that you have to manually clean up the checkpoint state after cancellation in this case.
ExternalizedCheckpointCleanup.DELETE_ON_CANCELLATION: Delete the externalized checkpoint when the job is cancelled. The checkpoint state will only be available if the job fails.
For more details
https://ci.apache.org/projects/flink/flink-docs-release-1.4/ops/state/checkpoints.html

Resources