Jobs stuck while trying to restart from a checkpoint - apache-flink

Context
We are using Flink to run a number of streaming jobs that read from Kafka, perform some SQL transformation and write the output to Kafka. It runs on Kubernetes with two jobmanagers and many taskmanagers. Our jobs use checkpointing with RocksDB and our checkpoints are written on a bucket in AWS S3.
Recently, we upgraded from Flink 1.13.1 to Flink 1.15.2. We used the savepoint mechanism to stop our jobs and restart them on the new version. We have two Kubernetes clusters. Right after the migration, everything seemed fine for both of them. But a few days (almost a month for the first cluster, 2 or 3 days for the second one) we now have other problems (which may or may not be related to the migration to Flink 1.15 as they happened later).
Description of the problem
Recently, we noticed that a few jobs failed to start. We see that the "Source" tasks in the execution graph stay CREATED while all others down in the graph (ChangelogNormalize, Writer) are RUNNING. The jobs restart regularly with the error (stacktrace simplified for readability):
java.lang.Exception: Cannot deploy task Source: source_consumer -> *anonymous_datastream_source$81*[211] (1/8) (de8f109e944dfa92d35cdc3f79f41e6f) - TaskManager (<address>) not responding after a rpcTimeout of 10000 ms
at org.apache.flink.runtime.executiongraph.Execution.lambda$deploy$5(Execution.java:602)
...
Caused by: java.util.concurrent.TimeoutException: Invocation of [RemoteRpcInvocation(TaskExecutorGateway.submitTask(TaskDeploymentDescriptor, JobMasterId, Time))] at recipient [akka.tcp://flink#<address>/user/rpc/taskmanager_0] timed out. This is usually caused by: 1) Akka failed sending the message silently, due to problems like oversized payload or serialization failures. In that case, you should find detailed error information in the logs. 2) The recipient needs more time for responding, due to problems like slow machines or network jitters. In that case, you can try to increase akka.ask.timeout.
at org.apache.flink.runtime.jobmaster.RpcTaskManagerGateway.submitTask(RpcTaskManagerGateway.java:60)
at org.apache.flink.runtime.executiongraph.Execution.lambda$deploy$4(Execution.java:580)
at java.base/java.util.concurrent.CompletableFuture$AsyncSupply.run(Unknown Source)
at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
at java.base/java.util.concurrent.FutureTask.run(Unknown Source)
at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(Unknown Source)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.base/java.lang.Thread.run(Unknown Source)
Caused by: akka.pattern.AskTimeoutException: Ask timed out on [Actor[akka.tcp://flink#<address>/user/rpc/taskmanager_0#1723317240]] after [10000 ms]. Message of type [org.apache.flink.runtime.rpc.messages.RemoteRpcInvocation]. A typical reason for `AskTimeoutException` is that the recipient actor didn't send a reply.
We also noticed that message in the jobmanagers:
Discarding oversized payload sent to Actor[akka.tcp://flink#<address>/user/rpc/taskmanager_0#1153219611]: max allowed size 10485760b bytes, actual size of encoded class org.apache.flink.runtime.rpc.messages.RemoteRpcInvocation was 83938405 bytes.
It is not clear why such big Akka messages are sent. But when setting akka.framesize to a higher value (100MB), the timeout indeed disappears. And the task that were stuck in CREATED are now INITIALIZING.
However, the job then stays INITIALIZING for a very long time. Sometimes they do start, sometimes they fail with the error:
java.lang.OutOfMemoryError: Java heap space
Increasing the memory of the taskmanager helped for some jobs but not all. Overall, they seem to require a lot more memory and take a very long time to initialize. Sometimes we have a connection reset from S3:
Caused by: org.apache.flink.runtime.state.BackendBuildingException: Failed when trying to restore operator state backend
...
Caused by: java.lang.IllegalStateException: Connection pool shut down
New observations (08/02/2023): we discovered that the problematic jobs have a very large _metadata file in their checkpoint (168MB for the largest). Worse, it seems to double in size every time the job is resumed from its checkpoint (when the first checkpoint is performed after the restart, then the following checkpoints stay constant).
Questions
What could cause Akka messages that big when submitting a task?
Did something change between Flink 1.13 and Flink 1.15 that could explain those issues?
How can we determine what is taking all the heap memory?

Thought we did not understand everything, we found where the problem came from and managed to fix it.
TL;DR: The topic-partition-offset-states key was kept in the job state (checkpoints) when we switched from FlinkKafkaConsumer to KafkaSource. Though it wasn't used anymore, it grew exponentially, so we removed it from the checkpoints (using custom Java code) and restarted everything.
In details:
We switched from FlinkKafkaConsumer to KafkaSource. We made sure that they offsets were committed and used setStartingOffsets(OffsetsInitializer.committedOffsets()) when resuming the job from the savepoint (as explained in https://nightlies.apache.org/flink/flink-docs-release-1.14/release-notes/flink-1.14/#deprecate-flinkkafkaconsumer). That did work and our jobs resumed correctly in Flink 1.15 with correct offsets and a state that seemed good.
However, it looks like the source operators kept the topic-partition-offset-states key in their state. This was used by FlinkKakfaConsumer, but it is not used by KafkaSource.
For some reason (that we could not determine), the offsets in topic-partition-offset-states doubled in length sometimes when our jobs were recovered (we use HA on Kubernetes, so this can happen regularly if we restart Flink).
After some time, this list of offsets became so big that our _metadata files in the checkpoints became very big (168MB). This led to Akka timeouts as they exceeded the akka.framesize. Increasing the framesize helped, but increased the memory pressure, causing many heap memory errors. Besides, it just made the problem worse as the _metadata kept doubling in size beyond 10MB.
The problem was the same for the completedCheckpoint files in the high availability storage directory.
To fix that, we had to:
Deserialize the CompletedCheckpoint.
Update them, to remove the topic-partition-offset-states key from the states (making those files much smaller).
Re-serialize them and replace the original files.
Upon restart of the taskmanagers and jobmanagers, the jobs loaded correctly. After they wrote their first checkpoint, the _metadata files were back to a reasonable size.

Related

Flink committing to kafka takes longer than the checkpoint interval

I'm having issues understanding why my flink job commits to kafka consumer is taking so long. I have a checkpoint of 1s and the following warning appears. I'm currently using version 1.14.
Committing offsets to Kafka takes longer than the checkpoint interval. Skipping commit of previous offsets because newer complete checkpoint offsets are available. This does not compromise Flink's checkpoint integrity
Compared to some Kafka streams we have running, the commit latency takes around 100 ms.
Can you point me in the right direction? Are there any metrics that I can look at?
I tried to find metrics that could help to debug this
Since Flink is continually committing offsets (sometimes overlapping in the cases of longer-running commits), network related blips and other external issues that cause the checkpoint to take longer can result in what you are seeing (a subsequent checkpoint is completed prior to the success of the previous one).
There are a handful of useful metrics related to checkpointing that you may want to explore that might help determine what's occurring:
lastCheckpointDuration - The time it took to complete the last checkpoint (in milliseconds).
lastCheckpointSize - The checkpointed size of the last checkpoint (in bytes), this metric could be different from lastCheckpointFullSize if incremental checkpoint or changelog is enabled.
Monitoring these as well as some of the other checkpointing metrics, along with task/job manager logs, might help you piece together a story for what caused the slower commit to take so long.
If you find that you are continually encountering this, you may look at adjusting the checkpointing configuration for the job to tolerate these longer durations.

Flink task managers are not processing data after restart

I am new to flink and i deployed my flink application which basically perform simple pattern matching. It is deployed in Kubernetes cluster with 1 JM and 6 TM. I am sending messages of size 4.4k and 200k messages every 10 min to eventhub topic and performing load testing. I added restart strategy and checking pointing as below and i am not explicitly using any states in my code as there is no requirement for it
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
// start a checkpoint every 1000 ms
env.enableCheckpointing(interval, CheckpointingMode.EXACTLY_ONCE);
// advanced options:
// make sure 500 ms of progress happen between checkpoints
env.getCheckpointConfig().setMinPauseBetweenCheckpoints(1000);
// checkpoints have to complete within one minute, or are discarded
env.getCheckpointConfig().setCheckpointTimeout(120000);
// allow only one checkpoint to be in progress at the same time
env.getCheckpointConfig().setMaxConcurrentCheckpoints(1);
// enable externalized checkpoints which are retained after job cancellation
env.getCheckpointConfig().enableExternalizedCheckpoints(CheckpointConfig.ExternalizedCheckpointCleanup.RETAIN_ON_CANCELLATION);
// allow job recovery fallback to checkpoint when there is a more recent savepoint
env.getCheckpointConfig().setPreferCheckpointForRecovery(true);
env.setRestartStrategy(RestartStrategies.fixedDelayRestart(
5, // number of restart attempts
Time.of(5, TimeUnit.MINUTES) // delay
));
Initially i was facing Netty server issue with network buffer and i followed this link https://ci.apache.org/projects/flink/flink-docs-release-1.11/ops/config.html#taskmanager-network-memory-floating-buffers-per-gate flink network and heap memory optimizations and applied below settings and everything is working fine
taskmanager.network.memory.min: 256mb
taskmanager.network.memory.max: 1024mb
taskmanager.network.memory.buffers-per-channel: 8
taskmanager.memory.segment-size: 2mb
taskmanager.network.memory.floating-buffers-per-gate: 16
cluster.evenly-spread-out-slots: true
taskmanager.heap.size: 1024m
taskmanager.memory.framework.heap.size: 64mb
taskmanager.memory.managed.fraction: 0.7
taskmanager.memory.framework.off-heap.size: 64mb
taskmanager.memory.network.fraction: 0.4
taskmanager.memory.jvm-overhead.min: 256mb
taskmanager.memory.jvm-overhead.max: 1gb
taskmanager.memory.jvm-overhead.fraction: 0.4
But i have two below questions
If any task manager restarts because of any failures the task manager is restarting successfully and getting registered with job manager but after the restarted task manager don't perform any processing of data it will sit idle. Is this normal flink behavior or do i need to add any setting to make task manager to start processing again.
Sorry and correct me if my understanding is wrong, flink has a restart strategy in my code i made limit 5 attempts of restart. What will happen if my flink job is not successfully overcomes the task failure entire flink job will be remained in idle state and i have to restart job manually or is there any mechanism i can add to restart my job even after it crossed the limit of restart job attempts.
Is there any document to calculate the number of cores and memory i should assign to flink job cluster based on data size and rate at which my system receives the data ?
Is there any documentation on flink CEP optimization techniques?
This is the error stack trace i am seeing in job manager
I am seeing the below errors in my job manager logs before the pattern matching
Caused by: org.apache.flink.runtime.io.network.netty.exception.RemoteTransportException: Connection unexpectedly closed by remote task manager '/10.244.9.163:46377'. This might indicate that the remote task manager was lost.
at org.apache.flink.runtime.io.network.netty.CreditBasedPartitionRequestClientHandler.channelInactive(CreditBasedPartitionRequestClientHandler.java:136)
at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:257)
at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:243)
at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:236)
at org.apache.flink.shaded.netty4.io.netty.handler.codec.ByteToMessageDecoder.channelInputClosed(ByteToMessageDecoder.java:393)
at org.apache.flink.shaded.netty4.io.netty.handler.codec.ByteToMessageDecoder.channelInactive(ByteToMessageDecoder.java:358)
at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:257)
at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:243)
at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:236)
at org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline$HeadContext.channelInactive(DefaultChannelPipeline.java:1416)
at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:257)
at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:243)
at org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline.fireChannelInactive(DefaultChannelPipeline.java:912)
at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel$AbstractUnsafe$8.run(AbstractChannel.java:816)
at org.apache.flink.shaded.netty4.io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163)
at org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:416)
at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:515)
at org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:918)
at org.apache.flink.shaded.netty4.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
at java.lang.Thread.run(Thread.java:748)
Thanks in advance, please help me in resolving my doubts
Various points:
If your patterns involve matching temporal sequences (e.g., "A followed by B"), then you need state to do this. Most of Flink's sources and sinks also use state internally to record offsets, etc., and this state needs to be checkpointed if you care about exactly-once guarantees. If the patterns are being streamed in dynamically, then you'll want to store the patterns in Flink state as well.
Some of the comments in the code don't match the configuration parameters: e.g., "500 ms of progress" vs. 1000, "checkpoints have to complete within one minute" vs 120000. Also, keep in mind that the section of the documentation that you copied these settings from is not recommending best practices, but is instead illustrating how to make changes. In particular, env.getCheckpointConfig().setPreferCheckpointForRecovery(true); is a bad idea, and that config option should probably not exist.
Some of your entries in config.yaml are concerning. taskmanager.memory.managed.fraction is rather large (0.7) -- this only makes sense if you are using RocksDB, since managed memory has no other purpose for streaming. And taskmanager.memory.network.fraction and taskmanager.memory.jvm-overhead.fraction are both very large, and the sum of these three fractions is 1.5, which doesn't make sense.
In general the default network configuration works well across a wide range of deployment scenarios, and it is unusual to need to tune these settings, except in large clusters (which is not the case here). What sort of problems did you encounter?
As for your questions:
After a TM failure and recovery, the TMs should automatically resume processing from the most recent checkpoint. To diagnose why this isn't happening, we'll need more information. To gain experience with a deployment that handles this correctly, you can experiment with the Flink Operations Playground.
Once the configured restart strategy has played itself out, the job will FAIL, and Flink will no longer try to recover that job. You can, of course, build your own automation on top of Flink's REST API, if you want something more sophisticated.
Documentation on capacity planning? No, not really. This is generally figured out through trial and error. Different applications tend to have different requirements in ways that are difficult to anticipate. Things like your choice of serializer, state backend, number of keyBys, the sources and sinks, key skew, watermarking, and so on can all have significant impacts.
Documentation on optimizing CEP? No, sorry. The main points are
do everything you can to constrain the matches; avoid patterns that must keep state indefinitely
getEventsForPattern can be expensive

Is it possible to recover when a slot has been removed during a Flink streaming

I have a standalone cluster where there is a Flink streaming job with 1-hour event time windows. After 2-3 hour of a run, the job dies with the "org.apache.flink.util.FlinkException: The assigned slot ... was removed" exception.
The job is working well when my windows are only 15minutes.
How can the job recover after losing a slot?
Is it possible to run the same calculations on multiple slots to prevent this error?
Shall I increase any of the timeouts? if so which one?
Flink streaming job recovers from failures from checkpoint. If your checkpoint is externalized, for example in S3. You can manually or ask Flink automatically recover from the most recent checkpoint.
Depends on your upstream message queuing service, you will likely get duplicated messages. So it's good to make your ingestion idempotent.
Also, the slot removed failure can be the symptom of various failures.
underlying hardware
network
memory pressure
What do you see in the task manager log that was removed?

Flink: job fails when one TaskManager is OOM?

I was running the Flink 1.8 WordCount example job on Kubernetes, I noticed a behavior. Sometimes, a TaskManager pod gets OOMKilled and restarted (it is not a concern for now) but the whole job fails, the JobManager log shows The assigned slot XXX was removed.
My question is, why does the whole job fail? Is there a way that I can configure Flink to make the job more tolerant to transient TaskManager failures?
Apache Flink's fault tolerance mechanism is based on periodic checkpoints and can guarantee exactly-once state consistency, i.e., after recovering from a failure, the state is consistent and the same as if the failure never happened (assuming deterministic application logic of course).
In order to achieve this, Flink takes consistent snapshots of the application's state (so-called checkpoints) in regular intervals. In case of a failure, the whole application is reset to the latest competed checkpoint. For that, Flink (until Flink 1.8) always restarts the whole application. A failure is any reason that terminates a worker process, including application failure, JVM OOM, killed container, hardware failure, etc.
In Flink 1.9 (released a week ago, see announcement), Flink adds so-called failover regions (see here), which can reduce the number restarted tasks. For continuous streaming applications, this only applies if the application does not have a shuffle (keyBy, broadcast, partition, ...) operation. In that case, only the affected pipeline is restarted and all other pipelines continue processing data.
Running Flink jobs you should do a capacity plan previously, otherwise, you will meet the OOM problems frequently, in kubernetes environment you should calculate how many memories your job will cost and set the resources.limits.memory of your deployment higher than it as well as the resources.requests.memory, if the resources.requests.memory is much lower than your job actually cost your Pod will be fall in Evicted state this will cause your job to fail as well.
A container in a Pod may fail due to number of reasons like process in it exited with a non-zero exit code, or the container was killed for exceeding a memory limit
You can use the Jobs specification
.spec.template.spec.restartPolicy = "OnFailure"
So using this pod will stay in the system and container will re-run.
For more information on also check official job documentation : https://kubernetes.io/docs/concepts/workloads/controllers/jobs-run-to-completion/

flink log missing when processing large amount of data

I'm testing the performance of flink processing different amount of data, so I need the Job Runtime to record and analyse .
When I use flink to processing a small dataset like ten thousand records, I can get the Job Runtime log as below.
07/18/2017 17:41:47 DataSink (collect())(1/1) switched to FINISHED
07/18/2017 17:41:47 Job execution switched to status FINISHED.
Program execution finished
Job with JobID 3f7658725aaae8cd3427d2aad921f2ef has finished.
Job Runtime: 1124 ms
Accumulator Results:
- c28953fb854da74d18dc7c168b988ca2 (java.util.ArrayList) [15433 elements]
But when I use flink to processing a little bit larger dataset like Fifty thousand records, I can't get Job Runtime info, as below, and the shell stucked:
07/18/2017 17:49:33 DataSink (collect())(1/1) switched to FINISHED
07/18/2017 17:49:33 Job execution switched to status FINISHED.
Is there any configuration I need to modify?
Why the shell stucked when the dataset is bigger?
Hope someone can answer my doubts.Thanks~
Flink uses Akka for remote communication, and the accumulator results are sent as a single message back to the client. Akka imposes a maximum message size, and you may be hitting the limit. A few suggestions:
Check the JobManager log for error messages related to Akka.
Increase the maximum size via the Flink configuration, e.g. akka.framesize. See Flink documentation for more information.

Resources