When does the Flink flush the data into disk using Rocksdb? - apache-flink

I am using state for processing the data using Flink.
I used the rocksDB because the data size to be stored is relatively large compared to the memory size.
I set up the rocksdb configuration and running my flink app during several hours.
I expected that the job runs normally without any errors, but I found the job manager did not take the heartbeat from the task manager.
When I monitored the memory metrics, the off-heap in task manager is growing and the job is died.
I know that rocksdb is best choice for storing large objects in stream processing, but in my case it was not achieved.
For these reasons, I would like to know that the precise time when the flink flush the data into disk level in rocks db. And also, I would appreciate it if you could give any guidance for configuration setup for rocksDb.
For my case, I configured the below setups regarding rocksdb and others are default value.
state.backend: rocksdb
state.backend.incremental: true
state.backend.rocksdb.memory.fixed-per-slot: 924m
state.backend.rocksdb.memory.managed: false
state.checkpoints.dir: s3://nxflow-bucket/checkpoints
state.checkpoints.num-retained: 1
state.savepoints.dir: s3://nxflow-bucket/savepoints
And you can see that the off heap memory is growing as shown in a figure below.
Is there any setup for flushing time for disk level in rocksdb?
And also, my job is really using the disk for statebackend? If true, why my job is died?
Thanks.

Related

Does Flink RocksDB statebackend help restoring state?

I'm considering using RocksDB as a statebackend of flink job which has state size up to 1TB.
My environment
checkpoint dir: hdfs
flink job submit: yarn-per-job (per-job mode on yarn cluster)
If the job fails and retry attempts exceed maximum retry count and the job completely dies (or canceling the job), I think the checkpoint and the rocksdb file will be deleted(because I'm deploying job as per-job-mode and the task manager would also terminate).
Here, I think I lose all state and have no way to restore the state but I expect using RocksDB would help something to restore the state because it is a disk based statebackend. If not, what is the advantage of using RocksDB statebackend?
Would retaining the checkpoint on cancellation and restart the job from the checkpoint(or savepoint) help in this case?
Thank you
I would recommend to check out https://nightlies.apache.org/flink/flink-docs-master/docs/ops/production_ready/ for an overview of steps to consider before putting a Flink application in production. Choosing the right state backend is one of them.
What is important for state recovery is that you enable the snapshotting mechanism. That can be either checkpoints or savepoints, which you use with the configured state backend (like RocksDB). When configured properly, your state will be snapshotted to a durable storage, so you can recover from it in case of failures. RocksDB is commonly used for large state sizes, which can't fit into memory anymore.

Flink rocksdb doesn't seems to write to localdir

I have a flink environment with the following configurations.
state.backend: rocksdb
state.backend.async: true
state.backend.fs.memory-threshold: 1024
state.backend.fs.write-buffer-size: 4096
state.backend.incremental: true
state.backend.local-recovery: false
state.checkpoints.num-retained: 1
state.backend.rocksdb.localdir: /flink_data/rocksdb
state.backend.rocksdb.memory.managed: true
flink_data - PVC we claim in openshift pod
We are a job we large state and windows of cuple of hours, from what I understand rocksdb is better suited for large state jobs because of its ability to flush some of the data from the memory to lical storge as opposed to the default state backend that work only on the memory.
From the configuration we set we expect that the job that use rocksdb will flush the data to local dir.
From the check we did on the pvc of the pod I saw the rocks db file size are only 280kb.
In the other hand the taskmanger have 19Gb total process memory that from that he use 11.7Gb to the flink managed memory(that from the metrics in the flink ui seems to be always in 100% usage)
In addition to that we see from the rocksdb metrics that there is lots of keys in the db
All the files that have been craeted are exactly 4kb each
Tnx for the help in advance

Flink task managers are not processing data after restart

I am new to flink and i deployed my flink application which basically perform simple pattern matching. It is deployed in Kubernetes cluster with 1 JM and 6 TM. I am sending messages of size 4.4k and 200k messages every 10 min to eventhub topic and performing load testing. I added restart strategy and checking pointing as below and i am not explicitly using any states in my code as there is no requirement for it
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
// start a checkpoint every 1000 ms
env.enableCheckpointing(interval, CheckpointingMode.EXACTLY_ONCE);
// advanced options:
// make sure 500 ms of progress happen between checkpoints
env.getCheckpointConfig().setMinPauseBetweenCheckpoints(1000);
// checkpoints have to complete within one minute, or are discarded
env.getCheckpointConfig().setCheckpointTimeout(120000);
// allow only one checkpoint to be in progress at the same time
env.getCheckpointConfig().setMaxConcurrentCheckpoints(1);
// enable externalized checkpoints which are retained after job cancellation
env.getCheckpointConfig().enableExternalizedCheckpoints(CheckpointConfig.ExternalizedCheckpointCleanup.RETAIN_ON_CANCELLATION);
// allow job recovery fallback to checkpoint when there is a more recent savepoint
env.getCheckpointConfig().setPreferCheckpointForRecovery(true);
env.setRestartStrategy(RestartStrategies.fixedDelayRestart(
5, // number of restart attempts
Time.of(5, TimeUnit.MINUTES) // delay
));
Initially i was facing Netty server issue with network buffer and i followed this link https://ci.apache.org/projects/flink/flink-docs-release-1.11/ops/config.html#taskmanager-network-memory-floating-buffers-per-gate flink network and heap memory optimizations and applied below settings and everything is working fine
taskmanager.network.memory.min: 256mb
taskmanager.network.memory.max: 1024mb
taskmanager.network.memory.buffers-per-channel: 8
taskmanager.memory.segment-size: 2mb
taskmanager.network.memory.floating-buffers-per-gate: 16
cluster.evenly-spread-out-slots: true
taskmanager.heap.size: 1024m
taskmanager.memory.framework.heap.size: 64mb
taskmanager.memory.managed.fraction: 0.7
taskmanager.memory.framework.off-heap.size: 64mb
taskmanager.memory.network.fraction: 0.4
taskmanager.memory.jvm-overhead.min: 256mb
taskmanager.memory.jvm-overhead.max: 1gb
taskmanager.memory.jvm-overhead.fraction: 0.4
But i have two below questions
If any task manager restarts because of any failures the task manager is restarting successfully and getting registered with job manager but after the restarted task manager don't perform any processing of data it will sit idle. Is this normal flink behavior or do i need to add any setting to make task manager to start processing again.
Sorry and correct me if my understanding is wrong, flink has a restart strategy in my code i made limit 5 attempts of restart. What will happen if my flink job is not successfully overcomes the task failure entire flink job will be remained in idle state and i have to restart job manually or is there any mechanism i can add to restart my job even after it crossed the limit of restart job attempts.
Is there any document to calculate the number of cores and memory i should assign to flink job cluster based on data size and rate at which my system receives the data ?
Is there any documentation on flink CEP optimization techniques?
This is the error stack trace i am seeing in job manager
I am seeing the below errors in my job manager logs before the pattern matching
Caused by: org.apache.flink.runtime.io.network.netty.exception.RemoteTransportException: Connection unexpectedly closed by remote task manager '/10.244.9.163:46377'. This might indicate that the remote task manager was lost.
at org.apache.flink.runtime.io.network.netty.CreditBasedPartitionRequestClientHandler.channelInactive(CreditBasedPartitionRequestClientHandler.java:136)
at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:257)
at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:243)
at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:236)
at org.apache.flink.shaded.netty4.io.netty.handler.codec.ByteToMessageDecoder.channelInputClosed(ByteToMessageDecoder.java:393)
at org.apache.flink.shaded.netty4.io.netty.handler.codec.ByteToMessageDecoder.channelInactive(ByteToMessageDecoder.java:358)
at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:257)
at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:243)
at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:236)
at org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline$HeadContext.channelInactive(DefaultChannelPipeline.java:1416)
at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:257)
at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:243)
at org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline.fireChannelInactive(DefaultChannelPipeline.java:912)
at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel$AbstractUnsafe$8.run(AbstractChannel.java:816)
at org.apache.flink.shaded.netty4.io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163)
at org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:416)
at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:515)
at org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:918)
at org.apache.flink.shaded.netty4.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
at java.lang.Thread.run(Thread.java:748)
Thanks in advance, please help me in resolving my doubts
Various points:
If your patterns involve matching temporal sequences (e.g., "A followed by B"), then you need state to do this. Most of Flink's sources and sinks also use state internally to record offsets, etc., and this state needs to be checkpointed if you care about exactly-once guarantees. If the patterns are being streamed in dynamically, then you'll want to store the patterns in Flink state as well.
Some of the comments in the code don't match the configuration parameters: e.g., "500 ms of progress" vs. 1000, "checkpoints have to complete within one minute" vs 120000. Also, keep in mind that the section of the documentation that you copied these settings from is not recommending best practices, but is instead illustrating how to make changes. In particular, env.getCheckpointConfig().setPreferCheckpointForRecovery(true); is a bad idea, and that config option should probably not exist.
Some of your entries in config.yaml are concerning. taskmanager.memory.managed.fraction is rather large (0.7) -- this only makes sense if you are using RocksDB, since managed memory has no other purpose for streaming. And taskmanager.memory.network.fraction and taskmanager.memory.jvm-overhead.fraction are both very large, and the sum of these three fractions is 1.5, which doesn't make sense.
In general the default network configuration works well across a wide range of deployment scenarios, and it is unusual to need to tune these settings, except in large clusters (which is not the case here). What sort of problems did you encounter?
As for your questions:
After a TM failure and recovery, the TMs should automatically resume processing from the most recent checkpoint. To diagnose why this isn't happening, we'll need more information. To gain experience with a deployment that handles this correctly, you can experiment with the Flink Operations Playground.
Once the configured restart strategy has played itself out, the job will FAIL, and Flink will no longer try to recover that job. You can, of course, build your own automation on top of Flink's REST API, if you want something more sophisticated.
Documentation on capacity planning? No, not really. This is generally figured out through trial and error. Different applications tend to have different requirements in ways that are difficult to anticipate. Things like your choice of serializer, state backend, number of keyBys, the sources and sinks, key skew, watermarking, and so on can all have significant impacts.
Documentation on optimizing CEP? No, sorry. The main points are
do everything you can to constrain the matches; avoid patterns that must keep state indefinitely
getEventsForPattern can be expensive

How to make Flink job with huge state finish

We are running a Flink cluster to calculate historic terabytes of streaming data. The data calculation has a huge state for which we use keyed states - Value and Map states with RocksDb backend. At some point in the job calculation the job performance starts degrading, input and output rates drop to almost 0. At this point exceptions like 'Communication with Taskmanager X timeout error" can be seen in the logs, however the job is compromised even before.
I presume the problem we are facing has to the with the RocksDb's disk backend. As the state of the job grows it needs to access the Disk more often which drags the performance to 0. We have played with some of the options and have set some which make sense for our particular setup:
We are using the SPINNING_DISK_OPTIMIZED_HIGH_MEM predefined profile, further optimized with optimizeFiltersForHits and some other options which has somewhat improved performance. However not of this can provide a stable computation and on a job re-run against a bigger data set the job halts again.
What we are looking for is a way to modify the job so that it progresses at SOME speed even when the input and the state increases. We are running on AWS with limits set to around 15 GB for Task Manager and no limit on disk space.
using SPINNING_DISK_OPTIMIZED_HIGH_MEM will cost huge off-heap memory by memtable of RocksDB, Seeing as you are running job with memory limitation around 15GB, I think you will encounter the OOM issue, but if you choose the default predefined profile, you will face the write stall issue or CPU overhead by decompressing the page cache of Rocksdb, so I think you should increase the memory limitation.
and here are some post about Rocksdb FYI:
https://github.com/facebook/rocksdb/wiki/Memory-usage-in-RocksDB
https://www.ververica.com/blog/manage-rocksdb-memory-size-apache-flink

Clarification for State Backend

I have been reading Flink docs and I needed few clarification. Hopefully someone can help me out here.
State Backend - This basically refers to the location where the data for my operations will be stored, for example if I'm doing an aggregation on a 2 hr window, where will this data buffered will be stored. As pointed out in the docs, for a large state we should use RocksDB.
The RocksDBStateBackend holds in-flight data in a RocksDB database that is (per default) stored in the TaskManager data directories
Does in-flight data here refers to the incoming data say from kafka stream that has not been yet checkpointed ?
Upon checkpointing, the whole RocksDB database will be checkpointed into the configured file system and directory. Minimal metadata is stored in the JobManager’s memory
When using RocksDb when a checkpoint is created, the entire buffered data is stored in stored in disk. Then say when the window is to be triggered at end of 2 hr, this state which was stored in disk will be de-serialised and used for the operation ?
Note that the amount of state that you can keep is only limited by the amount of disk space available
Does this mean that I could run an analytical query on potentially high throughout stream with very limited resources. Suppose my Kafka Stream has a rate of 50k messages /sec, then I could run it on a single core on my EMR cluster and the tradeoff will be that Flink won't be able to catch up with the incoming rate and have a lag but given enough disk space it won't have a OOM error ?
When a checkpoint is completed, I assume that the completed aggregated checkpoint metadata (like the HDFS or S3 path from each TM) from all the TM will be sent to the JM ?. In case of TM failure, the JM will spin up a new JM and restore the state from the last checkpoint.
The default setting for JM in flink-conf.yaml - jobmanager.heap.size: 1024m.
My confusion here is why does JM needs 1Gb of heap memory. What all does a JM handles apart from synchronisation among TMs. How do I actually decide how much of memory should be configured for JM on production.
Can someone verify that my understanding is correct or not and point me in the correct direction. Thanks in advance!
Overall your understanding appears to be correct. One point: in the case of a TM failure, the JM will spin up a new TM and restore the state from the last checkpoint (rather than spinning up a new JM).
But to be a bit more precise, in the last few releases of Flink, what used to be a monolithic job manager has been refactored into separate components: a dispatcher that receives jobs from clients and starts new job managers as needed; a job manager that only is concerned with providing services to a single job; and a resource manager that starts up new TMs as needed. The resource manager is the only component that is cluster framework specific -- e.g., there is a YARN resource manager.
The job manager has other roles as well -- it is the checkpoint coordinator and the API endpoint for the web UI and metrics.
How much heap the JM needs is somewhat variable. The defaults were chosen to try to cover more than a narrow set of situations, and to work out of the box. Also, by default, checkpoints go to the JM heap, so it needs some space for that. If you have a small cluster and are checkpointing to a distributed filesystem, you should be able to get by with less than 1GB.

Resources