RocksDBStateBackend in Flink: how does it works exactly? - apache-flink

I have read the official Flink's documentation about the State Backends, here. In particular, I was interested in the RocksDBStateBackend.
I don't understand, if I enable this kind of backend, RocksDB will be accessible from TaskManagers through another node inside the Flink's cluster?
What I have understood so far about the RocksDBStateBackend is that Task Managers will store the states inside their memory, i.e. the memory of the JVM process. After that, will they send the states to store inside RocksDB? If yes, where is RocksDB inside the Flink's cluster? Where is it phisically?

RocksDB is an embedded database. If you are using RocksDB as your state backend for Flink, then each task manager has a local instance of RocksDB, which runs as a native (JNI) library inside the JVM. When using RocksDB, your state lives as serialized bytes on the local disk, with an in-memory (off-heap) cache.
During checkpointing, the SST files from RocksDB are copied from the local disk to the distributed file system where the checkpoint is stored. If the local recovery option is enabled, then a local copy is retained as well, to speed up recovery. But it wouldn't be safe to rely only on the local copy, as the local disk might be lost if the node fails. This is why checkpoints are always stored on a distributed file system.
The alternative to RocksDB is to use one of the heap-based state backends, in which case your state will live as objects on the JVM heap.

Related

Is state saved in TaskManager's memory regardless of state back end?

I know I can set the state backend both in the flink's configuration file(flink-conf.yaml) globally
or set in the per-job scope.
val env = StreamExecutionEnvironment.getExecutionEnvironment()
env.setStateBackend(new FsStateBackend("hdfs://namenode:40010/flink/checkpoints"))
I have one question here:
Where are the state data that belongs to the TasManager saved in TaskManager while the flink job keeps running? I mean that when one checkpoint is done,the checkpointed data will be saved in HDFS(chk-XXX) or RocksDB, but while the flink job keeps running, the taskManager will accumulate more and more states belonging to this task manager, are they always saved in memory?
If they are kept in memory, then the checkpoint data can't be too large,or else OOM may occur.
Can I use RocksDB in TaskManager process to save the TM's states data? Thanks!
With the FsStateBackend, the working state for each task manager is in memory (on the JVM heap), and state backups (checkpoints) go to a distributed file system, e.g., HDFS.
With the RocksDBStateBackend, the working state for each task manager is in a local RocksDB instance, i.e., on the local disk, and again, the state backups (checkpoints) go to a distributed file system, e.g., HDFS.
Flink never stores checkpoint data in RocksDB. That's not the role it plays. RocksDB is used as an ephemeral, embedded data store whose contents can be lost in the event that a task manager fails. This is an alternative to keeping the working state in memory (where it can also be lost when a task manager fails).

How to store checkpoint into remote RocksDB in Apache Flink

I know that there are three kinds of state backends in Apache Flink: MemoryStateBackend, FsStateBackend and RocksDBStateBackend.
MemoryStateBackend stores the checkpoints into local RAM, FsStateBackend stores the checkpoints into local FileSystem, and RocksDBStateBackend stores the checkpoints into RocksDB. I have some questions about the RocksDBStateBackend.
As my understanding, the mechanism of RocksDBStateBackend has been embedded into Apache Flink. The rocksDB is a kind of key-value DB. So If I'm right, it means that Flink will store all checkpoints into the embedded rocksDB, which uses the local disk.
If so, I think the disk could be exhausted in some cases because of the checkpoints stored into the rocksDB. Now I'm thinking if it is possible to configure a remote rocksDB to store these checkpoints? If it is possible, should we worry about the remote rocksDB crashing? If the remote rocksDB crashes, the jobs of Flink can not continue working, right?
There is no option to use an external or remote RocksDB with Apache Flink. RocksDB is an embedded key-value store with a local instance in each task manager.
Several points:
Flink makes a strong distinction between the working state, which is always local (for good performance), and state snapshots (checkpoints and savepoints), which are not local (for reliability they should be stored in a distributed file system).
The RocksDBStateBackend uses the local disk for working state. The other two state backends keep their working state on the Java heap.
The checkpoint coordinator arranges for all of these slices of data scattered across all of the task managers to be collected together into complete checkpoints that are stored elsewhere. In the case of the MemoryStateBackend those checkpoints are stored on the JobManager heap; for the other two, they are in a distributed file system.
You want to configure RocksDB to use the fastest available local file system. Try to use locally attached SSDs, and avoid network-attached storage (such as EBS). Do not try to use a distributed file system such as S3 as RocksDB's local storage.
state.backend.rocksdb.localdir controls where each local RocksDB stores its working state.
The parameter to the RocksDBStateBackend constructor controls where the checkpoints are stored. E.g., using S3 as recommended by #ezequiel is the obvious choice on AWS.
RocksDB can work with any supported Filesystem by Flink
https://ci.apache.org/projects/flink/flink-docs-stable/ops/filesystems/
If you are running Flink probably you want to checkpoint, and resume from them.
I would externalise the storage outside the node. I you are using a cloud provider like AWS, then S3 is the right option.
So you should probably write something like:
new RocksDBStateBackend("s3://my-bucket", true); and assing it to your execution environment.
Please check the above documentation to configure properly your filesystem.

Flink - What is localdir configuration in RocksDB?

I'm new to flink and I have some confusion about the state backend configuration.
As far as I know, RocksDB saves all of the application's state on the filesystem.
I use s3 to store the state, so I configured both state.checkpoints.dir and state.savepoints.dir pointed to my s3 bucket.
Now I see that there is another option related to RocksDB storage called state.backend.rocksdb.localdir. What is the purpose of this?(I saw I can't use s3 for this)
Also, if RocksDB uses the local machine storage for something, what will be when I use Kubernetes and my pod suddenly failed? should I use persistent storage?
Another thing, I'm not sure I understood all the state things correctly.
Does the checkpoint save all of my state? For example, when I use AggregationFunction and the application failed, when the application restored, does the aggregated value for each key is restored?
Each of Flink's state backends keeps its working state somewhere local to each worker, while persisting the checkpoints somewhere durable, such as S3. With the heap-based state backend, the working state is stored as objects on the JVM heap, while with RocksDB the working state is stored as serialized bytes on the local disk (with an in-memory, off-heap cache). For performance reasons you don't want to use S3 (or even network-attached storage) for state.backend.rocksdb.localdir. Use local SSD storage if you can.
Flink doesn't rely on the local rocksdb storage surviving failures, just as it doesn't expect state on the heap to survive a failure, so you can safely use ephemeral storage as the rocksdb.localdir. When the state does need to be recovered, the latest checkpoint is sufficient. (But the copy on the local disk can be used as an optimization, avoiding the need to read from the DFS: see the docs on state.backend.local-recovery for details.
During recovery the aggregated value for each key in an AggregationFunction will be restored, should your application fail. The checkpoints include everything, including state kept by the sources and sinks, windows, timers, ProcessFunctions, RichFunctions, etc.

can i use flink rocksDB state backend with local file system?

I am exploring using Flink rocksDb state backend, the documentation seems to imply i can use a regular file system such as: file:///data/flink/checkpoints, but the code javadoc only mentions hdfs or s3 option here.
I am wondering if it's possible to use local file system with flink rocksdb backend, thanks!
Flink docs: https://ci.apache.org/projects/flink/flink-docs-stable/ops/state/state_backends.html#the-rocksdbstatebackend
Flink code: https://github.com/apache/flink/blob/master/flink-state-backends/flink-statebackend-rocksdb/src/main/java/org/apache/flink/contrib/streaming/state/RocksDBStateBackend.java#L175
No, you should not do that!
With this path you configure the directory into which Flink writes checkpoints. A checkpoint is a copy of your application state that is used to restore the application state in case of a failure such as a machine failure. The path must point to a persistent and remote storage to be able to read the checkpoint in case that a process was killed ore a machine died. If a checkpoint was written to the local filesystem of a machine that failed, you would not be able to recover the job and restore the state.
However, you can write the checkpoint to a local path if this is a mount point of an NFS (or any other remote storage) that can be mounted from other machines as well.

Apache Flink: Why to choose the MemoryStateBackend over the FsStateBackend?

Flink has a MemoryStateBackend and a FsStateBackend (and a RocksDBStateBackend). Both seem to extend the HeapKeyedStateBackend, i.e. the mechanism for storing the current working state is entirely the same.
This SO answer says that the main difference lies in the MemoryStateBackend keeping a copy of the checkpoints in the JobManagers memory. (I wasn't able to glean any evidence for that from the source code.)
The MemoryStateBackend also limits the maximum state size per subtask.
Now I wonder: Why would you ever want to use the MemoryStateBackend?
As you said, both MemoryStateBackend and FSStateBackend are based on HeapKeyedStateBackend. This means, that both state backends maintain the state of an operator as regular objects on the JVM heap of the TaskManager, i.e., state is always accessed in memory.
The backends differ in how they persist the state for checkpoints. A checkpoint is a copy of the state of all operators of an application that is stored somewhere. In case of a failure, the application is restarted and the state of the operators is initialized from the checkpoint.
The FSStateBackend stores the checkpoint in a file system, typically HDFS, S3, or a NFS that is mounted on all worker nodes. The MemoryStateBackend stores the state in the JVM of the JobManager. This has the following pros and cons:
Pros:
No need to setup a (distributed) file system.
No need to configure a storage location.
Cons:
State is lost if the JobManager process dies.
Size of state is bound by the size of the JobManager memory.
Since checkpoints are lost if the JM goes down, the MemoryStateBackend is unsuitable for most production use cases. It can be useful for developing and testing stateful applications, because it requires not configuration or setup.

Resources