Flink: local dir for state.checkpoints.dir - apache-flink

I was trying to understand the implications of using local dir e.g. file:///checkpoints/ for state.checkpoints.dir. My confusion is that 1) there might be multiple TaskManagers, does that mean each will save its own checkpoints to its local disk? 2) does this work in the environment like Kubernetes? because Pods might be moved around in the cluster.

This won't work. state.checkpoints.dir must be a URI that is accessible to every machine in the cluster, i.e., some sort of distributed filesystem. This is necessary for recovery in situations in which a task manager has failed, or when state needs to be redistributed for rescaling.
You may also want each TaskManager to additionally keep a copy of its state locally for faster recovery; see Task Local Recovery for info on that option.

Related

K8s which access mode should be used in a persistentvolumeclaim for a database deployment?

I want to store the data from a PostgreSQL database in a persistentvolumeclaim.
(on a managed Kubernetes cluster on Microsoft Azure)
And I am not sure which access mode to choose.
Looking at the available access modes:
ReadWriteOnce
ReadOnlyMany
ReadWriteMany
ReadWriteOncePod
I would say, I should choose either ReadWriteOnce or ReadWriteMany.
Thinking about the fact that I might want to migrate the database pod to another node pool at some point, I would intuitively choose ReadWriteMany.
Is there any disadvantage if I choose ReadWriteMany instead of ReadWriteOnce?
You are correct with the migration, where the access mode should be set to ReadWriteMany.
Generally, if you have access mode ReadWriteOnce and a multinode cluster on microsoft azure, where multiple pods need to access the database, then the kubernetes will enforce all the pods to be scheduled on the node that mounts the volume first. Your node can be overloaded with pods. Now, if you have a DaemonSet where one pod is scheduled on each node, this could pose a problem. In this scenario, you are best with tagging the PVC and PV with access mode ReadWriteMany.
Therefore
if you want multiple pods to be scheduled on multiple nodes and have access to DB, for write and read permissions, use access mode ReadWriteMany
if you logically need to have pods/db on one node and know for sure, that you will keep the logic on the one node, use access mode ReadWriteOnce
You should choose ReadWriteOnce.
I'm a little more familiar with AWS, so I'll use it as a motivating example. In AWS, the easiest kind of persistent volume to get is backed by an Amazon Elastic Block Storage (EBS) volume. This can be attached to only one node at a time, which is the ReadWriteOnce semantics; but, if nothing is currently using the volume, it can be detached and reattached to another node, and the cluster knows how to do this.
Meanwhile, in the case of a PostgreSQL database storage (and most other database storage), only one process can be using the physical storage at a time, on one node or several. In the best case a second copy of the database pointing at the same storage will fail to start up; in the worst case you'll corrupt the data.
So:
It never makes sense to have the volume attached to more than one pod at a time
So it never makes sense to have the volume attached to more than one node at a time
And ReadWriteOnce volumes are very easy to come by, but ReadWriteMany may not be available by default
This logic probably applies to most use cases, particularly in a cloud environment, where you'll also have your cloud provider's native storage system available (AWS S3 buckets, for example). Sharing files between processes is fraught with peril, especially across multiple nodes. I'd almost always pick ReadWriteOnce absent a really specific need to use something else.

Flink, what does method setDbStoragePath do in RocksDBStateBackend?

I'm using flink 1.11 with RocksDBStateBackend, the code looks like this:
RocksDBStateBackend stateBackend = new RocksDBStateBackend("hdfs:///flink-checkpoints", true);
stateBackend.setDbStoragePath(config.getString("/tmp/rocksdb/"));
env.setStateBackend(stateBackend);
My questions are:
My understanding is that when DbStoragePath is set, Flink will put all checkpoints and state in a local disk (in my case /tmp/rocksdb) before storing into hadoop hdfs:///flink-checkpoints. Is that right? And if it's right, should I always set DbStoragePath for better performance?
Because Flink doesn't delete old checkpoints, I have a job periodically clean up old checkpoints. But I'm not sure is it safe to do that if I set incremental checkpoints?
The DbStoragePath is the location on the local disk where RocksDB keeps its working state. By default the tmp directory will be used. Ideally this should be fastest available disk -- e.g., SSD. Normally this is configured via state.backend.rocksdb.localdir.
If you are using incremental checkpoints, then the SST files from the DbStoragePath are copied to the state.checkpoints.dir. Otherwise full snapshots are written to the checkpoint directory and the DbStoragePath isn't involved.
Flink automatically deletes old checkpoints, except after canceling a job that is using retained checkpoints. It's not obvious how to safely delete an incremental, retained checkpoint -- you need to somehow know if any of those SST files are still referred to from the latest checkpoint. You might ask for advice on the user mailing list.

Flink production session cluster in EKS instance failure and recovery

I am newbie in Flink, planning to deploy Flink session cluster on EKS with 1 job manager and 5 task managers (each task managers with 4 slots). Different jobs will be submitted through UI for different usecase.
Let's say I have submitted a stateful job (job has simple counter logic using RichFlatMapFunction) backed by RocksDBStateBackend with S3 checkpointDataUri and DbStoragePath pointed to local file path and this job utilises 8 slot totally which is spreaded across two task managers and running fine without any issues for a day. Now following are my question,
1) My understanding about checkpointDataUri and DbStoragePath in RocksDBStateBackend is, checkpointDataUri stores the processed offset information in S3 (since I configured the checkpointDataUri with S3 prefix) and DbStoragePath contains all the state information which is used in RichFlatMapFunction. So all the stateful information are stored in checkpointDataUri which is available in local only. Please correct me If it is wrong.
2) Lets say my Ec2 instance was restarted (the one where the 4 slots was utilised) for some reason and it took around 30 minutes to come online, in this case, EKS will make the new Ec2 instance as TaskManager to match the replicas, however whether Flink job manager will try to reschedule the 4 slots to a different task manager now? If yes, how the state which was stored in Ec2 local instance has to be recovered?
3) Is there is any document/video for Flink EKS failure recovery related things. I saw the official documentation which specifies how to deploy Flink session cluster in EKS. But I don't find anything related to failure recovery in EKS mode. Could someone please point me in the right direction on this?
All of the state you are concerned about, namely the processed offsets and the state used in the RichFlatMapFunction (and any other state Flink is managing for your job) is stored both on the local disk (DbStoragePath) and in S3 (checkpointDataUri).
Flink always keeps a working copy of all of the state local to each task manager (for high throughput and low latency), and in the background makes complete copies of this state to a distributed file system (like S3) for reliability.
In other words, what you said in point (1) of your question was incorrect. And the answer to point (2) is that the state to be recovered can always be recovered from S3 if it's not available locally. As for point (3), there's nothing special about failure recovery on EKS compared to any other Flink deployment model.

How to store checkpoint into remote RocksDB in Apache Flink

I know that there are three kinds of state backends in Apache Flink: MemoryStateBackend, FsStateBackend and RocksDBStateBackend.
MemoryStateBackend stores the checkpoints into local RAM, FsStateBackend stores the checkpoints into local FileSystem, and RocksDBStateBackend stores the checkpoints into RocksDB. I have some questions about the RocksDBStateBackend.
As my understanding, the mechanism of RocksDBStateBackend has been embedded into Apache Flink. The rocksDB is a kind of key-value DB. So If I'm right, it means that Flink will store all checkpoints into the embedded rocksDB, which uses the local disk.
If so, I think the disk could be exhausted in some cases because of the checkpoints stored into the rocksDB. Now I'm thinking if it is possible to configure a remote rocksDB to store these checkpoints? If it is possible, should we worry about the remote rocksDB crashing? If the remote rocksDB crashes, the jobs of Flink can not continue working, right?
There is no option to use an external or remote RocksDB with Apache Flink. RocksDB is an embedded key-value store with a local instance in each task manager.
Several points:
Flink makes a strong distinction between the working state, which is always local (for good performance), and state snapshots (checkpoints and savepoints), which are not local (for reliability they should be stored in a distributed file system).
The RocksDBStateBackend uses the local disk for working state. The other two state backends keep their working state on the Java heap.
The checkpoint coordinator arranges for all of these slices of data scattered across all of the task managers to be collected together into complete checkpoints that are stored elsewhere. In the case of the MemoryStateBackend those checkpoints are stored on the JobManager heap; for the other two, they are in a distributed file system.
You want to configure RocksDB to use the fastest available local file system. Try to use locally attached SSDs, and avoid network-attached storage (such as EBS). Do not try to use a distributed file system such as S3 as RocksDB's local storage.
state.backend.rocksdb.localdir controls where each local RocksDB stores its working state.
The parameter to the RocksDBStateBackend constructor controls where the checkpoints are stored. E.g., using S3 as recommended by #ezequiel is the obvious choice on AWS.
RocksDB can work with any supported Filesystem by Flink
https://ci.apache.org/projects/flink/flink-docs-stable/ops/filesystems/
If you are running Flink probably you want to checkpoint, and resume from them.
I would externalise the storage outside the node. I you are using a cloud provider like AWS, then S3 is the right option.
So you should probably write something like:
new RocksDBStateBackend("s3://my-bucket", true); and assing it to your execution environment.
Please check the above documentation to configure properly your filesystem.

What would happen if I configured a local file system for Flink checkpointing?

I have saw a video named Managing State in Apache Flink - Tzu-Li (Gordon) Tai.
In this video, it stores data with distributed file system.
I'm wondering that what would happen if I configured a local file system for Flink checkpointing?
eg:
env.setStateBackend(new RocksDBStateBackend(getString("flie:///tmp/checkpoints"), true));
I assume that every node of Flink cluster will keep their own data. Would it work well?
I assume that every node of Flink cluster will keep their own data.
That is correct.
Would it work well?
With a local file system and distributed nodes you would may be able to checkpoint just fine (even that is not certain, as the directory may be getting created by the JobManager so the TaskManager instances potentially will fail with the directory not existing) however you would not be able to restore, as the JobManager reads that and distributes that out to the operators as needed.
Strictly speaking, it does not matter if the file system is local or distributed to flink. What is important is that the JobManager as restore time is able to see all of the checkpoint data. If you are running with everything on the same machine, then a local file system would work just fine.
I think in principle you could even have all nodes write locally and then manually use a synchronization process to move the data to somewhere that the JobManager could see it during an attempted restore, however that is certainly not a recommended approach.

Resources