Is state saved in TaskManager's memory regardless of state back end? - apache-flink

I know I can set the state backend both in the flink's configuration file(flink-conf.yaml) globally
or set in the per-job scope.
val env = StreamExecutionEnvironment.getExecutionEnvironment()
env.setStateBackend(new FsStateBackend("hdfs://namenode:40010/flink/checkpoints"))
I have one question here:
Where are the state data that belongs to the TasManager saved in TaskManager while the flink job keeps running? I mean that when one checkpoint is done,the checkpointed data will be saved in HDFS(chk-XXX) or RocksDB, but while the flink job keeps running, the taskManager will accumulate more and more states belonging to this task manager, are they always saved in memory?
If they are kept in memory, then the checkpoint data can't be too large,or else OOM may occur.
Can I use RocksDB in TaskManager process to save the TM's states data? Thanks!

With the FsStateBackend, the working state for each task manager is in memory (on the JVM heap), and state backups (checkpoints) go to a distributed file system, e.g., HDFS.
With the RocksDBStateBackend, the working state for each task manager is in a local RocksDB instance, i.e., on the local disk, and again, the state backups (checkpoints) go to a distributed file system, e.g., HDFS.
Flink never stores checkpoint data in RocksDB. That's not the role it plays. RocksDB is used as an ephemeral, embedded data store whose contents can be lost in the event that a task manager fails. This is an alternative to keeping the working state in memory (where it can also be lost when a task manager fails).

Related

Flink-RocksDB behaviour after task manager failure

I am experimenting with my new Flink cluster(3 Different Machines-> 1 Job Manager, 2-> Task Managers) using RocksDB as State Backend however the checkpointing behaviour I am getting is a little confusing.
More specifically, I have designed a simple WordCount example and my data source is netcat. When I submit my job, the job manager assigns it to a random task manager(no replication as well). I provide some words and then I kill the currenlty running task manager. After a while, the job restarts in the other task manager and I can provide some new words. The confusing part is that state from the first task manager is preserved even when I have killed it.
To my understanding, RocksDB maintains its state in a local directory of the running task manager, so what I expected was when the first task manager was killed to lose the entire state and start counting words from the beginning. So Flink seems to somehow maintain its state in the memory(?) or broadcasts it through JobManager?
Am I missing something?
The RocksDB state backend does keep its working state on each task manager's local disk, while checkpoints are normally stored in a distributed filesystem.
If you have checkpointing enabled, then the spare task manager is able to recover the state from the latest checkpoint and resume processing.

How to store checkpoint into remote RocksDB in Apache Flink

I know that there are three kinds of state backends in Apache Flink: MemoryStateBackend, FsStateBackend and RocksDBStateBackend.
MemoryStateBackend stores the checkpoints into local RAM, FsStateBackend stores the checkpoints into local FileSystem, and RocksDBStateBackend stores the checkpoints into RocksDB. I have some questions about the RocksDBStateBackend.
As my understanding, the mechanism of RocksDBStateBackend has been embedded into Apache Flink. The rocksDB is a kind of key-value DB. So If I'm right, it means that Flink will store all checkpoints into the embedded rocksDB, which uses the local disk.
If so, I think the disk could be exhausted in some cases because of the checkpoints stored into the rocksDB. Now I'm thinking if it is possible to configure a remote rocksDB to store these checkpoints? If it is possible, should we worry about the remote rocksDB crashing? If the remote rocksDB crashes, the jobs of Flink can not continue working, right?
There is no option to use an external or remote RocksDB with Apache Flink. RocksDB is an embedded key-value store with a local instance in each task manager.
Several points:
Flink makes a strong distinction between the working state, which is always local (for good performance), and state snapshots (checkpoints and savepoints), which are not local (for reliability they should be stored in a distributed file system).
The RocksDBStateBackend uses the local disk for working state. The other two state backends keep their working state on the Java heap.
The checkpoint coordinator arranges for all of these slices of data scattered across all of the task managers to be collected together into complete checkpoints that are stored elsewhere. In the case of the MemoryStateBackend those checkpoints are stored on the JobManager heap; for the other two, they are in a distributed file system.
You want to configure RocksDB to use the fastest available local file system. Try to use locally attached SSDs, and avoid network-attached storage (such as EBS). Do not try to use a distributed file system such as S3 as RocksDB's local storage.
state.backend.rocksdb.localdir controls where each local RocksDB stores its working state.
The parameter to the RocksDBStateBackend constructor controls where the checkpoints are stored. E.g., using S3 as recommended by #ezequiel is the obvious choice on AWS.
RocksDB can work with any supported Filesystem by Flink
https://ci.apache.org/projects/flink/flink-docs-stable/ops/filesystems/
If you are running Flink probably you want to checkpoint, and resume from them.
I would externalise the storage outside the node. I you are using a cloud provider like AWS, then S3 is the right option.
So you should probably write something like:
new RocksDBStateBackend("s3://my-bucket", true); and assing it to your execution environment.
Please check the above documentation to configure properly your filesystem.

RocksDBStateBackend in Flink: how does it works exactly?

I have read the official Flink's documentation about the State Backends, here. In particular, I was interested in the RocksDBStateBackend.
I don't understand, if I enable this kind of backend, RocksDB will be accessible from TaskManagers through another node inside the Flink's cluster?
What I have understood so far about the RocksDBStateBackend is that Task Managers will store the states inside their memory, i.e. the memory of the JVM process. After that, will they send the states to store inside RocksDB? If yes, where is RocksDB inside the Flink's cluster? Where is it phisically?
RocksDB is an embedded database. If you are using RocksDB as your state backend for Flink, then each task manager has a local instance of RocksDB, which runs as a native (JNI) library inside the JVM. When using RocksDB, your state lives as serialized bytes on the local disk, with an in-memory (off-heap) cache.
During checkpointing, the SST files from RocksDB are copied from the local disk to the distributed file system where the checkpoint is stored. If the local recovery option is enabled, then a local copy is retained as well, to speed up recovery. But it wouldn't be safe to rely only on the local copy, as the local disk might be lost if the node fails. This is why checkpoints are always stored on a distributed file system.
The alternative to RocksDB is to use one of the heap-based state backends, in which case your state will live as objects on the JVM heap.

Clarification for State Backend

I have been reading Flink docs and I needed few clarification. Hopefully someone can help me out here.
State Backend - This basically refers to the location where the data for my operations will be stored, for example if I'm doing an aggregation on a 2 hr window, where will this data buffered will be stored. As pointed out in the docs, for a large state we should use RocksDB.
The RocksDBStateBackend holds in-flight data in a RocksDB database that is (per default) stored in the TaskManager data directories
Does in-flight data here refers to the incoming data say from kafka stream that has not been yet checkpointed ?
Upon checkpointing, the whole RocksDB database will be checkpointed into the configured file system and directory. Minimal metadata is stored in the JobManager’s memory
When using RocksDb when a checkpoint is created, the entire buffered data is stored in stored in disk. Then say when the window is to be triggered at end of 2 hr, this state which was stored in disk will be de-serialised and used for the operation ?
Note that the amount of state that you can keep is only limited by the amount of disk space available
Does this mean that I could run an analytical query on potentially high throughout stream with very limited resources. Suppose my Kafka Stream has a rate of 50k messages /sec, then I could run it on a single core on my EMR cluster and the tradeoff will be that Flink won't be able to catch up with the incoming rate and have a lag but given enough disk space it won't have a OOM error ?
When a checkpoint is completed, I assume that the completed aggregated checkpoint metadata (like the HDFS or S3 path from each TM) from all the TM will be sent to the JM ?. In case of TM failure, the JM will spin up a new JM and restore the state from the last checkpoint.
The default setting for JM in flink-conf.yaml - jobmanager.heap.size: 1024m.
My confusion here is why does JM needs 1Gb of heap memory. What all does a JM handles apart from synchronisation among TMs. How do I actually decide how much of memory should be configured for JM on production.
Can someone verify that my understanding is correct or not and point me in the correct direction. Thanks in advance!
Overall your understanding appears to be correct. One point: in the case of a TM failure, the JM will spin up a new TM and restore the state from the last checkpoint (rather than spinning up a new JM).
But to be a bit more precise, in the last few releases of Flink, what used to be a monolithic job manager has been refactored into separate components: a dispatcher that receives jobs from clients and starts new job managers as needed; a job manager that only is concerned with providing services to a single job; and a resource manager that starts up new TMs as needed. The resource manager is the only component that is cluster framework specific -- e.g., there is a YARN resource manager.
The job manager has other roles as well -- it is the checkpoint coordinator and the API endpoint for the web UI and metrics.
How much heap the JM needs is somewhat variable. The defaults were chosen to try to cover more than a narrow set of situations, and to work out of the box. Also, by default, checkpoints go to the JM heap, so it needs some space for that. If you have a small cluster and are checkpointing to a distributed filesystem, you should be able to get by with less than 1GB.

Restore MapState after Job restart/cancellation

I have to Aggregate the count/sum on event stream for various entities.
event logs(json str) are received from kafka and populate map entityname as key and value is count of the selective attriibutes as json str .
MapState sourceAggregationMap = getRuntimeContext().getMapState(sourceAggregationDesc);
for each event stream repopulate the value .
problem is whenever job gets stopped (failed)/cancelled and when the job gets restarted map state is not getting reinitialized / restored . again count starts from 0.
using Apache flink 1.6.0
state.backend: rocksdb
Checkpoints are used for automatic recovery from failures, and need to be explicitly enabled and configured. Savepoints are triggered manually and are used for restarts and upgrades. Both rely on the same snapshotting mechanism which is described in detail here.
These snapshots capture the entire state of the distributed pipeline, recording offsets into the input queues as well as the state throughout the job graph that has resulted from having ingested the data up to that point. When a failure occurs, the sources are rewound, the state is restored, and processing is resumed.
With the RocksDB state backend, the working state is held on the local disk (in a location you configure), and checkpoints are durably persisted to a distributed file system (again, configurable). When a job is cancelled, the checkpoints are normally deleted (as they will no longer be needed for recovery), but they can be configured to be retained. If your jobs aren't recovering their state after failures, perhaps the checkpoints are failing, or the job is failing before the first checkpoint can complete. The web ui has a section that displays information about checkpoints, and the logs should also have helpful information.
Update: see also this answer.

Resources