Flink rocksdb doesn't seems to write to localdir - apache-flink

I have a flink environment with the following configurations.
state.backend: rocksdb
state.backend.async: true
state.backend.fs.memory-threshold: 1024
state.backend.fs.write-buffer-size: 4096
state.backend.incremental: true
state.backend.local-recovery: false
state.checkpoints.num-retained: 1
state.backend.rocksdb.localdir: /flink_data/rocksdb
state.backend.rocksdb.memory.managed: true
flink_data - PVC we claim in openshift pod
We are a job we large state and windows of cuple of hours, from what I understand rocksdb is better suited for large state jobs because of its ability to flush some of the data from the memory to lical storge as opposed to the default state backend that work only on the memory.
From the configuration we set we expect that the job that use rocksdb will flush the data to local dir.
From the check we did on the pvc of the pod I saw the rocks db file size are only 280kb.
In the other hand the taskmanger have 19Gb total process memory that from that he use 11.7Gb to the flink managed memory(that from the metrics in the flink ui seems to be always in 100% usage)
In addition to that we see from the rocksdb metrics that there is lots of keys in the db
All the files that have been craeted are exactly 4kb each
Tnx for the help in advance

Related

When does the Flink flush the data into disk using Rocksdb?

I am using state for processing the data using Flink.
I used the rocksDB because the data size to be stored is relatively large compared to the memory size.
I set up the rocksdb configuration and running my flink app during several hours.
I expected that the job runs normally without any errors, but I found the job manager did not take the heartbeat from the task manager.
When I monitored the memory metrics, the off-heap in task manager is growing and the job is died.
I know that rocksdb is best choice for storing large objects in stream processing, but in my case it was not achieved.
For these reasons, I would like to know that the precise time when the flink flush the data into disk level in rocks db. And also, I would appreciate it if you could give any guidance for configuration setup for rocksDb.
For my case, I configured the below setups regarding rocksdb and others are default value.
state.backend: rocksdb
state.backend.incremental: true
state.backend.rocksdb.memory.fixed-per-slot: 924m
state.backend.rocksdb.memory.managed: false
state.checkpoints.dir: s3://nxflow-bucket/checkpoints
state.checkpoints.num-retained: 1
state.savepoints.dir: s3://nxflow-bucket/savepoints
And you can see that the off heap memory is growing as shown in a figure below.
Is there any setup for flushing time for disk level in rocksdb?
And also, my job is really using the disk for statebackend? If true, why my job is died?
Thanks.

Does Flink RocksDB statebackend help restoring state?

I'm considering using RocksDB as a statebackend of flink job which has state size up to 1TB.
My environment
checkpoint dir: hdfs
flink job submit: yarn-per-job (per-job mode on yarn cluster)
If the job fails and retry attempts exceed maximum retry count and the job completely dies (or canceling the job), I think the checkpoint and the rocksdb file will be deleted(because I'm deploying job as per-job-mode and the task manager would also terminate).
Here, I think I lose all state and have no way to restore the state but I expect using RocksDB would help something to restore the state because it is a disk based statebackend. If not, what is the advantage of using RocksDB statebackend?
Would retaining the checkpoint on cancellation and restart the job from the checkpoint(or savepoint) help in this case?
Thank you
I would recommend to check out https://nightlies.apache.org/flink/flink-docs-master/docs/ops/production_ready/ for an overview of steps to consider before putting a Flink application in production. Choosing the right state backend is one of them.
What is important for state recovery is that you enable the snapshotting mechanism. That can be either checkpoints or savepoints, which you use with the configured state backend (like RocksDB). When configured properly, your state will be snapshotted to a durable storage, so you can recover from it in case of failures. RocksDB is commonly used for large state sizes, which can't fit into memory anymore.

Having consumer issues when RocksDB in flink

I have a job which consumes from RabbitMQ, I was using FS State Backend but it seems that the sizes of states became bigger and then I decide to move my states to RocksDB.
The issue is that during the first hours running the job is fine, event after more time if traffic get slower, but then when the traffic gets high again then the consumer start to have issues (events pilled up as unacked) and then these issues are reflected in the rest of the app.
I have:
4 CPU core
Local disk
16GB RAM
Unix environment
Flink 1.11
Scala version 2.11
1 single job running with few keyedStreams, and around 10 transformations, and sink to Postgres
some configurations
flink.buffer_timeout=50
flink.maxparallelism=4
flink.memory=16
flink.cpu.cores=4
#checkpoints
flink.checkpointing_compression=true
flink.checkpointing_min_pause=30000
flink.checkpointing_timeout=120000
flink.checkpointing_enabled=true
flink.checkpointing_time=60000
flink.max_current_checkpoint=1
#RocksDB configuration
state.backend.rocksdb.localdir=home/username/checkpoints (this is not working don't know why)
state.backend.rocksdb.thread.numfactory=4
state.backend.rocksdb.block.blocksize=16kb
state.backend.rocksdb.block.cache-size=512mb
#rocksdb or heap
state.backend.rocksdb.timer-service.factory=heap (I have test with rocksdb too and is the same)
state.backend.rocksdb.predefined-options=SPINNING_DISK_OPTIMIZED
Let me know if more information is needed?
state.backend.rocksdb.localdir should be an absolute path, not a relative one. And this setting isn't for specifying where checkpoints go (which shouldn't be on the local disk), this setting is for specifying where the working state is kept (which should be on the local disk).
Your job is experiencing backpressure, meaning that some part of the pipeline can't keep up. The most common causes of backpressure are (1) sinks that can't keep up, and (2) inadequate resources (e.g., the parallelism is too low).
You can test if postgres is the problem by running the job with a discarding sink.
Looking at various metrics should give you an idea of what resources might be under-provisioned.

How to store checkpoint into remote RocksDB in Apache Flink

I know that there are three kinds of state backends in Apache Flink: MemoryStateBackend, FsStateBackend and RocksDBStateBackend.
MemoryStateBackend stores the checkpoints into local RAM, FsStateBackend stores the checkpoints into local FileSystem, and RocksDBStateBackend stores the checkpoints into RocksDB. I have some questions about the RocksDBStateBackend.
As my understanding, the mechanism of RocksDBStateBackend has been embedded into Apache Flink. The rocksDB is a kind of key-value DB. So If I'm right, it means that Flink will store all checkpoints into the embedded rocksDB, which uses the local disk.
If so, I think the disk could be exhausted in some cases because of the checkpoints stored into the rocksDB. Now I'm thinking if it is possible to configure a remote rocksDB to store these checkpoints? If it is possible, should we worry about the remote rocksDB crashing? If the remote rocksDB crashes, the jobs of Flink can not continue working, right?
There is no option to use an external or remote RocksDB with Apache Flink. RocksDB is an embedded key-value store with a local instance in each task manager.
Several points:
Flink makes a strong distinction between the working state, which is always local (for good performance), and state snapshots (checkpoints and savepoints), which are not local (for reliability they should be stored in a distributed file system).
The RocksDBStateBackend uses the local disk for working state. The other two state backends keep their working state on the Java heap.
The checkpoint coordinator arranges for all of these slices of data scattered across all of the task managers to be collected together into complete checkpoints that are stored elsewhere. In the case of the MemoryStateBackend those checkpoints are stored on the JobManager heap; for the other two, they are in a distributed file system.
You want to configure RocksDB to use the fastest available local file system. Try to use locally attached SSDs, and avoid network-attached storage (such as EBS). Do not try to use a distributed file system such as S3 as RocksDB's local storage.
state.backend.rocksdb.localdir controls where each local RocksDB stores its working state.
The parameter to the RocksDBStateBackend constructor controls where the checkpoints are stored. E.g., using S3 as recommended by #ezequiel is the obvious choice on AWS.
RocksDB can work with any supported Filesystem by Flink
https://ci.apache.org/projects/flink/flink-docs-stable/ops/filesystems/
If you are running Flink probably you want to checkpoint, and resume from them.
I would externalise the storage outside the node. I you are using a cloud provider like AWS, then S3 is the right option.
So you should probably write something like:
new RocksDBStateBackend("s3://my-bucket", true); and assing it to your execution environment.
Please check the above documentation to configure properly your filesystem.

RocksDBStateBackend in Flink: how does it works exactly?

I have read the official Flink's documentation about the State Backends, here. In particular, I was interested in the RocksDBStateBackend.
I don't understand, if I enable this kind of backend, RocksDB will be accessible from TaskManagers through another node inside the Flink's cluster?
What I have understood so far about the RocksDBStateBackend is that Task Managers will store the states inside their memory, i.e. the memory of the JVM process. After that, will they send the states to store inside RocksDB? If yes, where is RocksDB inside the Flink's cluster? Where is it phisically?
RocksDB is an embedded database. If you are using RocksDB as your state backend for Flink, then each task manager has a local instance of RocksDB, which runs as a native (JNI) library inside the JVM. When using RocksDB, your state lives as serialized bytes on the local disk, with an in-memory (off-heap) cache.
During checkpointing, the SST files from RocksDB are copied from the local disk to the distributed file system where the checkpoint is stored. If the local recovery option is enabled, then a local copy is retained as well, to speed up recovery. But it wouldn't be safe to rely only on the local copy, as the local disk might be lost if the node fails. This is why checkpoints are always stored on a distributed file system.
The alternative to RocksDB is to use one of the heap-based state backends, in which case your state will live as objects on the JVM heap.

Resources