Flink: job fails when one TaskManager is OOM? - apache-flink

I was running the Flink 1.8 WordCount example job on Kubernetes, I noticed a behavior. Sometimes, a TaskManager pod gets OOMKilled and restarted (it is not a concern for now) but the whole job fails, the JobManager log shows The assigned slot XXX was removed.
My question is, why does the whole job fail? Is there a way that I can configure Flink to make the job more tolerant to transient TaskManager failures?

Apache Flink's fault tolerance mechanism is based on periodic checkpoints and can guarantee exactly-once state consistency, i.e., after recovering from a failure, the state is consistent and the same as if the failure never happened (assuming deterministic application logic of course).
In order to achieve this, Flink takes consistent snapshots of the application's state (so-called checkpoints) in regular intervals. In case of a failure, the whole application is reset to the latest competed checkpoint. For that, Flink (until Flink 1.8) always restarts the whole application. A failure is any reason that terminates a worker process, including application failure, JVM OOM, killed container, hardware failure, etc.
In Flink 1.9 (released a week ago, see announcement), Flink adds so-called failover regions (see here), which can reduce the number restarted tasks. For continuous streaming applications, this only applies if the application does not have a shuffle (keyBy, broadcast, partition, ...) operation. In that case, only the affected pipeline is restarted and all other pipelines continue processing data.

Running Flink jobs you should do a capacity plan previously, otherwise, you will meet the OOM problems frequently, in kubernetes environment you should calculate how many memories your job will cost and set the resources.limits.memory of your deployment higher than it as well as the resources.requests.memory, if the resources.requests.memory is much lower than your job actually cost your Pod will be fall in Evicted state this will cause your job to fail as well.

A container in a Pod may fail due to number of reasons like process in it exited with a non-zero exit code, or the container was killed for exceeding a memory limit
You can use the Jobs specification
.spec.template.spec.restartPolicy = "OnFailure"
So using this pod will stay in the system and container will re-run.
For more information on also check official job documentation : https://kubernetes.io/docs/concepts/workloads/controllers/jobs-run-to-completion/

Related

Flink task managers are not processing data after restart

I am new to flink and i deployed my flink application which basically perform simple pattern matching. It is deployed in Kubernetes cluster with 1 JM and 6 TM. I am sending messages of size 4.4k and 200k messages every 10 min to eventhub topic and performing load testing. I added restart strategy and checking pointing as below and i am not explicitly using any states in my code as there is no requirement for it
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
// start a checkpoint every 1000 ms
env.enableCheckpointing(interval, CheckpointingMode.EXACTLY_ONCE);
// advanced options:
// make sure 500 ms of progress happen between checkpoints
env.getCheckpointConfig().setMinPauseBetweenCheckpoints(1000);
// checkpoints have to complete within one minute, or are discarded
env.getCheckpointConfig().setCheckpointTimeout(120000);
// allow only one checkpoint to be in progress at the same time
env.getCheckpointConfig().setMaxConcurrentCheckpoints(1);
// enable externalized checkpoints which are retained after job cancellation
env.getCheckpointConfig().enableExternalizedCheckpoints(CheckpointConfig.ExternalizedCheckpointCleanup.RETAIN_ON_CANCELLATION);
// allow job recovery fallback to checkpoint when there is a more recent savepoint
env.getCheckpointConfig().setPreferCheckpointForRecovery(true);
env.setRestartStrategy(RestartStrategies.fixedDelayRestart(
5, // number of restart attempts
Time.of(5, TimeUnit.MINUTES) // delay
));
Initially i was facing Netty server issue with network buffer and i followed this link https://ci.apache.org/projects/flink/flink-docs-release-1.11/ops/config.html#taskmanager-network-memory-floating-buffers-per-gate flink network and heap memory optimizations and applied below settings and everything is working fine
taskmanager.network.memory.min: 256mb
taskmanager.network.memory.max: 1024mb
taskmanager.network.memory.buffers-per-channel: 8
taskmanager.memory.segment-size: 2mb
taskmanager.network.memory.floating-buffers-per-gate: 16
cluster.evenly-spread-out-slots: true
taskmanager.heap.size: 1024m
taskmanager.memory.framework.heap.size: 64mb
taskmanager.memory.managed.fraction: 0.7
taskmanager.memory.framework.off-heap.size: 64mb
taskmanager.memory.network.fraction: 0.4
taskmanager.memory.jvm-overhead.min: 256mb
taskmanager.memory.jvm-overhead.max: 1gb
taskmanager.memory.jvm-overhead.fraction: 0.4
But i have two below questions
If any task manager restarts because of any failures the task manager is restarting successfully and getting registered with job manager but after the restarted task manager don't perform any processing of data it will sit idle. Is this normal flink behavior or do i need to add any setting to make task manager to start processing again.
Sorry and correct me if my understanding is wrong, flink has a restart strategy in my code i made limit 5 attempts of restart. What will happen if my flink job is not successfully overcomes the task failure entire flink job will be remained in idle state and i have to restart job manually or is there any mechanism i can add to restart my job even after it crossed the limit of restart job attempts.
Is there any document to calculate the number of cores and memory i should assign to flink job cluster based on data size and rate at which my system receives the data ?
Is there any documentation on flink CEP optimization techniques?
This is the error stack trace i am seeing in job manager
I am seeing the below errors in my job manager logs before the pattern matching
Caused by: org.apache.flink.runtime.io.network.netty.exception.RemoteTransportException: Connection unexpectedly closed by remote task manager '/10.244.9.163:46377'. This might indicate that the remote task manager was lost.
at org.apache.flink.runtime.io.network.netty.CreditBasedPartitionRequestClientHandler.channelInactive(CreditBasedPartitionRequestClientHandler.java:136)
at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:257)
at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:243)
at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:236)
at org.apache.flink.shaded.netty4.io.netty.handler.codec.ByteToMessageDecoder.channelInputClosed(ByteToMessageDecoder.java:393)
at org.apache.flink.shaded.netty4.io.netty.handler.codec.ByteToMessageDecoder.channelInactive(ByteToMessageDecoder.java:358)
at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:257)
at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:243)
at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:236)
at org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline$HeadContext.channelInactive(DefaultChannelPipeline.java:1416)
at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:257)
at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:243)
at org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline.fireChannelInactive(DefaultChannelPipeline.java:912)
at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel$AbstractUnsafe$8.run(AbstractChannel.java:816)
at org.apache.flink.shaded.netty4.io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163)
at org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:416)
at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:515)
at org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:918)
at org.apache.flink.shaded.netty4.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
at java.lang.Thread.run(Thread.java:748)
Thanks in advance, please help me in resolving my doubts
Various points:
If your patterns involve matching temporal sequences (e.g., "A followed by B"), then you need state to do this. Most of Flink's sources and sinks also use state internally to record offsets, etc., and this state needs to be checkpointed if you care about exactly-once guarantees. If the patterns are being streamed in dynamically, then you'll want to store the patterns in Flink state as well.
Some of the comments in the code don't match the configuration parameters: e.g., "500 ms of progress" vs. 1000, "checkpoints have to complete within one minute" vs 120000. Also, keep in mind that the section of the documentation that you copied these settings from is not recommending best practices, but is instead illustrating how to make changes. In particular, env.getCheckpointConfig().setPreferCheckpointForRecovery(true); is a bad idea, and that config option should probably not exist.
Some of your entries in config.yaml are concerning. taskmanager.memory.managed.fraction is rather large (0.7) -- this only makes sense if you are using RocksDB, since managed memory has no other purpose for streaming. And taskmanager.memory.network.fraction and taskmanager.memory.jvm-overhead.fraction are both very large, and the sum of these three fractions is 1.5, which doesn't make sense.
In general the default network configuration works well across a wide range of deployment scenarios, and it is unusual to need to tune these settings, except in large clusters (which is not the case here). What sort of problems did you encounter?
As for your questions:
After a TM failure and recovery, the TMs should automatically resume processing from the most recent checkpoint. To diagnose why this isn't happening, we'll need more information. To gain experience with a deployment that handles this correctly, you can experiment with the Flink Operations Playground.
Once the configured restart strategy has played itself out, the job will FAIL, and Flink will no longer try to recover that job. You can, of course, build your own automation on top of Flink's REST API, if you want something more sophisticated.
Documentation on capacity planning? No, not really. This is generally figured out through trial and error. Different applications tend to have different requirements in ways that are difficult to anticipate. Things like your choice of serializer, state backend, number of keyBys, the sources and sinks, key skew, watermarking, and so on can all have significant impacts.
Documentation on optimizing CEP? No, sorry. The main points are
do everything you can to constrain the matches; avoid patterns that must keep state indefinitely
getEventsForPattern can be expensive

Having consumer issues when RocksDB in flink

I have a job which consumes from RabbitMQ, I was using FS State Backend but it seems that the sizes of states became bigger and then I decide to move my states to RocksDB.
The issue is that during the first hours running the job is fine, event after more time if traffic get slower, but then when the traffic gets high again then the consumer start to have issues (events pilled up as unacked) and then these issues are reflected in the rest of the app.
I have:
4 CPU core
Local disk
16GB RAM
Unix environment
Flink 1.11
Scala version 2.11
1 single job running with few keyedStreams, and around 10 transformations, and sink to Postgres
some configurations
flink.buffer_timeout=50
flink.maxparallelism=4
flink.memory=16
flink.cpu.cores=4
#checkpoints
flink.checkpointing_compression=true
flink.checkpointing_min_pause=30000
flink.checkpointing_timeout=120000
flink.checkpointing_enabled=true
flink.checkpointing_time=60000
flink.max_current_checkpoint=1
#RocksDB configuration
state.backend.rocksdb.localdir=home/username/checkpoints (this is not working don't know why)
state.backend.rocksdb.thread.numfactory=4
state.backend.rocksdb.block.blocksize=16kb
state.backend.rocksdb.block.cache-size=512mb
#rocksdb or heap
state.backend.rocksdb.timer-service.factory=heap (I have test with rocksdb too and is the same)
state.backend.rocksdb.predefined-options=SPINNING_DISK_OPTIMIZED
Let me know if more information is needed?
state.backend.rocksdb.localdir should be an absolute path, not a relative one. And this setting isn't for specifying where checkpoints go (which shouldn't be on the local disk), this setting is for specifying where the working state is kept (which should be on the local disk).
Your job is experiencing backpressure, meaning that some part of the pipeline can't keep up. The most common causes of backpressure are (1) sinks that can't keep up, and (2) inadequate resources (e.g., the parallelism is too low).
You can test if postgres is the problem by running the job with a discarding sink.
Looking at various metrics should give you an idea of what resources might be under-provisioned.

Is it possible to recover when a slot has been removed during a Flink streaming

I have a standalone cluster where there is a Flink streaming job with 1-hour event time windows. After 2-3 hour of a run, the job dies with the "org.apache.flink.util.FlinkException: The assigned slot ... was removed" exception.
The job is working well when my windows are only 15minutes.
How can the job recover after losing a slot?
Is it possible to run the same calculations on multiple slots to prevent this error?
Shall I increase any of the timeouts? if so which one?
Flink streaming job recovers from failures from checkpoint. If your checkpoint is externalized, for example in S3. You can manually or ask Flink automatically recover from the most recent checkpoint.
Depends on your upstream message queuing service, you will likely get duplicated messages. So it's good to make your ingestion idempotent.
Also, the slot removed failure can be the symptom of various failures.
underlying hardware
network
memory pressure
What do you see in the task manager log that was removed?

Stream Processing: How often should a checkpoint be initiated?

I am setting up an analytics pipeline using Apache Flink to process a stream of IoT data. While attempting to configure the system, I cannot seem to find any sources for how often checkpointing should be initiated? Are there any recommendations or hard-and-fast rules of thumb? e.g. 1 second, 10 seconds, 1 minutes, etc.?
EDIT: Also, is there a way of programmatically configuring the checkpoint interval at runtime?
This depends on two things:
How much data are you willing to reprocess in the case of failure (The job will restarts from the last completed checkpoint)?
How often are you able to checkpoint due to data transfer limits and the duration of the checkpoint itself?
In my experience most users use checkpoint intervals in the order of 10 seconds, but also configure a "min-pause-between-checkpoints" [1].
[1] https://ci.apache.org/projects/flink/flink-docs-release-1.9/dev/stream/state/checkpointing.html#enabling-and-configuring-checkpointing
One other thing to consider beyond what was already mentioned: if you are depending on a transactional sink for exactly-once semantics, then those transactions will be committed as part of completing each checkpoint. This means that any downstream consumers of those transactions will experience latency that is more-or-less determined by the checkpointing interval of your job.

Flink exactly-once message processing

I've setup a Flink 1.2 standalone cluster with 2 JobManagers and 3 TaskManagers and I'm using JMeter to load-test it by producing Kafka messages / events which are then processed. The processing job runs on a TaskManager and it usually takes ~15K events/s.
The job has set EXACTLY_ONCE checkpointing and is persisting state and checkpoints to Amazon S3.
If I shutdown the TaskManager running the job it takes a bit, a few seconds, then the job is resumed on a different TaskManager. The job mainly logs the event ids which are consecutive integers (e.g. from 0 to 1200000).
When I check the output on the TaskManager I shut down the last count is for example 500000, then when I check the output on the resumed job on a different TaskManager it starts with ~ 400000. This means ~100K of duplicated events. This number is dependent on the speed of the test can be higher or lower.
Not sure if I'm missing something but I would expect the job to display the next consecutive number (like 500001) after resuming on the different TaskManager.
Does anyone know why this is happening / extra settings I have to configure to obtain the exactly once?
You are seeing the expected behavior for exactly-once. Flink implements fault-tolerance via a combination of checkpointing and replay in the case of failures. The guarantee is not that each event will be sent into the pipeline exactly once, but rather that each event will affect your pipeline's state exactly once.
Checkpointing creates a consistent snapshot across the entire cluster. During recovery, operator state is restored and the sources are replayed from the most recent checkpoint.
For a more thorough explanation, see this data Artisans blog post: High-throughput, low-latency, and exactly-once stream processing with Apache Flinkā„¢, or the Flink docs.

Resources