Flink production session cluster in EKS instance failure and recovery - apache-flink

I am newbie in Flink, planning to deploy Flink session cluster on EKS with 1 job manager and 5 task managers (each task managers with 4 slots). Different jobs will be submitted through UI for different usecase.
Let's say I have submitted a stateful job (job has simple counter logic using RichFlatMapFunction) backed by RocksDBStateBackend with S3 checkpointDataUri and DbStoragePath pointed to local file path and this job utilises 8 slot totally which is spreaded across two task managers and running fine without any issues for a day. Now following are my question,
1) My understanding about checkpointDataUri and DbStoragePath in RocksDBStateBackend is, checkpointDataUri stores the processed offset information in S3 (since I configured the checkpointDataUri with S3 prefix) and DbStoragePath contains all the state information which is used in RichFlatMapFunction. So all the stateful information are stored in checkpointDataUri which is available in local only. Please correct me If it is wrong.
2) Lets say my Ec2 instance was restarted (the one where the 4 slots was utilised) for some reason and it took around 30 minutes to come online, in this case, EKS will make the new Ec2 instance as TaskManager to match the replicas, however whether Flink job manager will try to reschedule the 4 slots to a different task manager now? If yes, how the state which was stored in Ec2 local instance has to be recovered?
3) Is there is any document/video for Flink EKS failure recovery related things. I saw the official documentation which specifies how to deploy Flink session cluster in EKS. But I don't find anything related to failure recovery in EKS mode. Could someone please point me in the right direction on this?

All of the state you are concerned about, namely the processed offsets and the state used in the RichFlatMapFunction (and any other state Flink is managing for your job) is stored both on the local disk (DbStoragePath) and in S3 (checkpointDataUri).
Flink always keeps a working copy of all of the state local to each task manager (for high throughput and low latency), and in the background makes complete copies of this state to a distributed file system (like S3) for reliability.
In other words, what you said in point (1) of your question was incorrect. And the answer to point (2) is that the state to be recovered can always be recovered from S3 if it's not available locally. As for point (3), there's nothing special about failure recovery on EKS compared to any other Flink deployment model.

Related

Flink task managers are not processing data after restart

I am new to flink and i deployed my flink application which basically perform simple pattern matching. It is deployed in Kubernetes cluster with 1 JM and 6 TM. I am sending messages of size 4.4k and 200k messages every 10 min to eventhub topic and performing load testing. I added restart strategy and checking pointing as below and i am not explicitly using any states in my code as there is no requirement for it
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
// start a checkpoint every 1000 ms
env.enableCheckpointing(interval, CheckpointingMode.EXACTLY_ONCE);
// advanced options:
// make sure 500 ms of progress happen between checkpoints
env.getCheckpointConfig().setMinPauseBetweenCheckpoints(1000);
// checkpoints have to complete within one minute, or are discarded
env.getCheckpointConfig().setCheckpointTimeout(120000);
// allow only one checkpoint to be in progress at the same time
env.getCheckpointConfig().setMaxConcurrentCheckpoints(1);
// enable externalized checkpoints which are retained after job cancellation
env.getCheckpointConfig().enableExternalizedCheckpoints(CheckpointConfig.ExternalizedCheckpointCleanup.RETAIN_ON_CANCELLATION);
// allow job recovery fallback to checkpoint when there is a more recent savepoint
env.getCheckpointConfig().setPreferCheckpointForRecovery(true);
env.setRestartStrategy(RestartStrategies.fixedDelayRestart(
5, // number of restart attempts
Time.of(5, TimeUnit.MINUTES) // delay
));
Initially i was facing Netty server issue with network buffer and i followed this link https://ci.apache.org/projects/flink/flink-docs-release-1.11/ops/config.html#taskmanager-network-memory-floating-buffers-per-gate flink network and heap memory optimizations and applied below settings and everything is working fine
taskmanager.network.memory.min: 256mb
taskmanager.network.memory.max: 1024mb
taskmanager.network.memory.buffers-per-channel: 8
taskmanager.memory.segment-size: 2mb
taskmanager.network.memory.floating-buffers-per-gate: 16
cluster.evenly-spread-out-slots: true
taskmanager.heap.size: 1024m
taskmanager.memory.framework.heap.size: 64mb
taskmanager.memory.managed.fraction: 0.7
taskmanager.memory.framework.off-heap.size: 64mb
taskmanager.memory.network.fraction: 0.4
taskmanager.memory.jvm-overhead.min: 256mb
taskmanager.memory.jvm-overhead.max: 1gb
taskmanager.memory.jvm-overhead.fraction: 0.4
But i have two below questions
If any task manager restarts because of any failures the task manager is restarting successfully and getting registered with job manager but after the restarted task manager don't perform any processing of data it will sit idle. Is this normal flink behavior or do i need to add any setting to make task manager to start processing again.
Sorry and correct me if my understanding is wrong, flink has a restart strategy in my code i made limit 5 attempts of restart. What will happen if my flink job is not successfully overcomes the task failure entire flink job will be remained in idle state and i have to restart job manually or is there any mechanism i can add to restart my job even after it crossed the limit of restart job attempts.
Is there any document to calculate the number of cores and memory i should assign to flink job cluster based on data size and rate at which my system receives the data ?
Is there any documentation on flink CEP optimization techniques?
This is the error stack trace i am seeing in job manager
I am seeing the below errors in my job manager logs before the pattern matching
Caused by: org.apache.flink.runtime.io.network.netty.exception.RemoteTransportException: Connection unexpectedly closed by remote task manager '/10.244.9.163:46377'. This might indicate that the remote task manager was lost.
at org.apache.flink.runtime.io.network.netty.CreditBasedPartitionRequestClientHandler.channelInactive(CreditBasedPartitionRequestClientHandler.java:136)
at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:257)
at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:243)
at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:236)
at org.apache.flink.shaded.netty4.io.netty.handler.codec.ByteToMessageDecoder.channelInputClosed(ByteToMessageDecoder.java:393)
at org.apache.flink.shaded.netty4.io.netty.handler.codec.ByteToMessageDecoder.channelInactive(ByteToMessageDecoder.java:358)
at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:257)
at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:243)
at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:236)
at org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline$HeadContext.channelInactive(DefaultChannelPipeline.java:1416)
at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:257)
at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:243)
at org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline.fireChannelInactive(DefaultChannelPipeline.java:912)
at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel$AbstractUnsafe$8.run(AbstractChannel.java:816)
at org.apache.flink.shaded.netty4.io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163)
at org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:416)
at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:515)
at org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:918)
at org.apache.flink.shaded.netty4.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
at java.lang.Thread.run(Thread.java:748)
Thanks in advance, please help me in resolving my doubts
Various points:
If your patterns involve matching temporal sequences (e.g., "A followed by B"), then you need state to do this. Most of Flink's sources and sinks also use state internally to record offsets, etc., and this state needs to be checkpointed if you care about exactly-once guarantees. If the patterns are being streamed in dynamically, then you'll want to store the patterns in Flink state as well.
Some of the comments in the code don't match the configuration parameters: e.g., "500 ms of progress" vs. 1000, "checkpoints have to complete within one minute" vs 120000. Also, keep in mind that the section of the documentation that you copied these settings from is not recommending best practices, but is instead illustrating how to make changes. In particular, env.getCheckpointConfig().setPreferCheckpointForRecovery(true); is a bad idea, and that config option should probably not exist.
Some of your entries in config.yaml are concerning. taskmanager.memory.managed.fraction is rather large (0.7) -- this only makes sense if you are using RocksDB, since managed memory has no other purpose for streaming. And taskmanager.memory.network.fraction and taskmanager.memory.jvm-overhead.fraction are both very large, and the sum of these three fractions is 1.5, which doesn't make sense.
In general the default network configuration works well across a wide range of deployment scenarios, and it is unusual to need to tune these settings, except in large clusters (which is not the case here). What sort of problems did you encounter?
As for your questions:
After a TM failure and recovery, the TMs should automatically resume processing from the most recent checkpoint. To diagnose why this isn't happening, we'll need more information. To gain experience with a deployment that handles this correctly, you can experiment with the Flink Operations Playground.
Once the configured restart strategy has played itself out, the job will FAIL, and Flink will no longer try to recover that job. You can, of course, build your own automation on top of Flink's REST API, if you want something more sophisticated.
Documentation on capacity planning? No, not really. This is generally figured out through trial and error. Different applications tend to have different requirements in ways that are difficult to anticipate. Things like your choice of serializer, state backend, number of keyBys, the sources and sinks, key skew, watermarking, and so on can all have significant impacts.
Documentation on optimizing CEP? No, sorry. The main points are
do everything you can to constrain the matches; avoid patterns that must keep state indefinitely
getEventsForPattern can be expensive

Having consumer issues when RocksDB in flink

I have a job which consumes from RabbitMQ, I was using FS State Backend but it seems that the sizes of states became bigger and then I decide to move my states to RocksDB.
The issue is that during the first hours running the job is fine, event after more time if traffic get slower, but then when the traffic gets high again then the consumer start to have issues (events pilled up as unacked) and then these issues are reflected in the rest of the app.
I have:
4 CPU core
Local disk
16GB RAM
Unix environment
Flink 1.11
Scala version 2.11
1 single job running with few keyedStreams, and around 10 transformations, and sink to Postgres
some configurations
flink.buffer_timeout=50
flink.maxparallelism=4
flink.memory=16
flink.cpu.cores=4
#checkpoints
flink.checkpointing_compression=true
flink.checkpointing_min_pause=30000
flink.checkpointing_timeout=120000
flink.checkpointing_enabled=true
flink.checkpointing_time=60000
flink.max_current_checkpoint=1
#RocksDB configuration
state.backend.rocksdb.localdir=home/username/checkpoints (this is not working don't know why)
state.backend.rocksdb.thread.numfactory=4
state.backend.rocksdb.block.blocksize=16kb
state.backend.rocksdb.block.cache-size=512mb
#rocksdb or heap
state.backend.rocksdb.timer-service.factory=heap (I have test with rocksdb too and is the same)
state.backend.rocksdb.predefined-options=SPINNING_DISK_OPTIMIZED
Let me know if more information is needed?
state.backend.rocksdb.localdir should be an absolute path, not a relative one. And this setting isn't for specifying where checkpoints go (which shouldn't be on the local disk), this setting is for specifying where the working state is kept (which should be on the local disk).
Your job is experiencing backpressure, meaning that some part of the pipeline can't keep up. The most common causes of backpressure are (1) sinks that can't keep up, and (2) inadequate resources (e.g., the parallelism is too low).
You can test if postgres is the problem by running the job with a discarding sink.
Looking at various metrics should give you an idea of what resources might be under-provisioned.

Flink: local dir for state.checkpoints.dir

I was trying to understand the implications of using local dir e.g. file:///checkpoints/ for state.checkpoints.dir. My confusion is that 1) there might be multiple TaskManagers, does that mean each will save its own checkpoints to its local disk? 2) does this work in the environment like Kubernetes? because Pods might be moved around in the cluster.
This won't work. state.checkpoints.dir must be a URI that is accessible to every machine in the cluster, i.e., some sort of distributed filesystem. This is necessary for recovery in situations in which a task manager has failed, or when state needs to be redistributed for rescaling.
You may also want each TaskManager to additionally keep a copy of its state locally for faster recovery; see Task Local Recovery for info on that option.

Clarification for State Backend

I have been reading Flink docs and I needed few clarification. Hopefully someone can help me out here.
State Backend - This basically refers to the location where the data for my operations will be stored, for example if I'm doing an aggregation on a 2 hr window, where will this data buffered will be stored. As pointed out in the docs, for a large state we should use RocksDB.
The RocksDBStateBackend holds in-flight data in a RocksDB database that is (per default) stored in the TaskManager data directories
Does in-flight data here refers to the incoming data say from kafka stream that has not been yet checkpointed ?
Upon checkpointing, the whole RocksDB database will be checkpointed into the configured file system and directory. Minimal metadata is stored in the JobManager’s memory
When using RocksDb when a checkpoint is created, the entire buffered data is stored in stored in disk. Then say when the window is to be triggered at end of 2 hr, this state which was stored in disk will be de-serialised and used for the operation ?
Note that the amount of state that you can keep is only limited by the amount of disk space available
Does this mean that I could run an analytical query on potentially high throughout stream with very limited resources. Suppose my Kafka Stream has a rate of 50k messages /sec, then I could run it on a single core on my EMR cluster and the tradeoff will be that Flink won't be able to catch up with the incoming rate and have a lag but given enough disk space it won't have a OOM error ?
When a checkpoint is completed, I assume that the completed aggregated checkpoint metadata (like the HDFS or S3 path from each TM) from all the TM will be sent to the JM ?. In case of TM failure, the JM will spin up a new JM and restore the state from the last checkpoint.
The default setting for JM in flink-conf.yaml - jobmanager.heap.size: 1024m.
My confusion here is why does JM needs 1Gb of heap memory. What all does a JM handles apart from synchronisation among TMs. How do I actually decide how much of memory should be configured for JM on production.
Can someone verify that my understanding is correct or not and point me in the correct direction. Thanks in advance!
Overall your understanding appears to be correct. One point: in the case of a TM failure, the JM will spin up a new TM and restore the state from the last checkpoint (rather than spinning up a new JM).
But to be a bit more precise, in the last few releases of Flink, what used to be a monolithic job manager has been refactored into separate components: a dispatcher that receives jobs from clients and starts new job managers as needed; a job manager that only is concerned with providing services to a single job; and a resource manager that starts up new TMs as needed. The resource manager is the only component that is cluster framework specific -- e.g., there is a YARN resource manager.
The job manager has other roles as well -- it is the checkpoint coordinator and the API endpoint for the web UI and metrics.
How much heap the JM needs is somewhat variable. The defaults were chosen to try to cover more than a narrow set of situations, and to work out of the box. Also, by default, checkpoints go to the JM heap, so it needs some space for that. If you have a small cluster and are checkpointing to a distributed filesystem, you should be able to get by with less than 1GB.

What would happen if I configured a local file system for Flink checkpointing?

I have saw a video named Managing State in Apache Flink - Tzu-Li (Gordon) Tai.
In this video, it stores data with distributed file system.
I'm wondering that what would happen if I configured a local file system for Flink checkpointing?
eg:
env.setStateBackend(new RocksDBStateBackend(getString("flie:///tmp/checkpoints"), true));
I assume that every node of Flink cluster will keep their own data. Would it work well?
I assume that every node of Flink cluster will keep their own data.
That is correct.
Would it work well?
With a local file system and distributed nodes you would may be able to checkpoint just fine (even that is not certain, as the directory may be getting created by the JobManager so the TaskManager instances potentially will fail with the directory not existing) however you would not be able to restore, as the JobManager reads that and distributes that out to the operators as needed.
Strictly speaking, it does not matter if the file system is local or distributed to flink. What is important is that the JobManager as restore time is able to see all of the checkpoint data. If you are running with everything on the same machine, then a local file system would work just fine.
I think in principle you could even have all nodes write locally and then manually use a synchronization process to move the data to somewhere that the JobManager could see it during an attempted restore, however that is certainly not a recommended approach.

Resources