I've setup a Flink 1.2 standalone cluster with 2 JobManagers and 3 TaskManagers and I'm using JMeter to load-test it by producing Kafka messages / events which are then processed. The processing job runs on a TaskManager and it usually takes ~15K events/s.
The job has set EXACTLY_ONCE checkpointing and is persisting state and checkpoints to Amazon S3.
If I shutdown the TaskManager running the job it takes a bit, a few seconds, then the job is resumed on a different TaskManager. The job mainly logs the event ids which are consecutive integers (e.g. from 0 to 1200000).
When I check the output on the TaskManager I shut down the last count is for example 500000, then when I check the output on the resumed job on a different TaskManager it starts with ~ 400000. This means ~100K of duplicated events. This number is dependent on the speed of the test can be higher or lower.
Not sure if I'm missing something but I would expect the job to display the next consecutive number (like 500001) after resuming on the different TaskManager.
Does anyone know why this is happening / extra settings I have to configure to obtain the exactly once?
You are seeing the expected behavior for exactly-once. Flink implements fault-tolerance via a combination of checkpointing and replay in the case of failures. The guarantee is not that each event will be sent into the pipeline exactly once, but rather that each event will affect your pipeline's state exactly once.
Checkpointing creates a consistent snapshot across the entire cluster. During recovery, operator state is restored and the sources are replayed from the most recent checkpoint.
For a more thorough explanation, see this data Artisans blog post: High-throughput, low-latency, and exactly-once stream processing with Apache Flinkā¢, or the Flink docs.
Related
I am new to flink and i deployed my flink application which basically perform simple pattern matching. It is deployed in Kubernetes cluster with 1 JM and 6 TM. I am sending messages of size 4.4k and 200k messages every 10 min to eventhub topic and performing load testing. I added restart strategy and checking pointing as below and i am not explicitly using any states in my code as there is no requirement for it
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
// start a checkpoint every 1000 ms
env.enableCheckpointing(interval, CheckpointingMode.EXACTLY_ONCE);
// advanced options:
// make sure 500 ms of progress happen between checkpoints
env.getCheckpointConfig().setMinPauseBetweenCheckpoints(1000);
// checkpoints have to complete within one minute, or are discarded
env.getCheckpointConfig().setCheckpointTimeout(120000);
// allow only one checkpoint to be in progress at the same time
env.getCheckpointConfig().setMaxConcurrentCheckpoints(1);
// enable externalized checkpoints which are retained after job cancellation
env.getCheckpointConfig().enableExternalizedCheckpoints(CheckpointConfig.ExternalizedCheckpointCleanup.RETAIN_ON_CANCELLATION);
// allow job recovery fallback to checkpoint when there is a more recent savepoint
env.getCheckpointConfig().setPreferCheckpointForRecovery(true);
env.setRestartStrategy(RestartStrategies.fixedDelayRestart(
5, // number of restart attempts
Time.of(5, TimeUnit.MINUTES) // delay
));
Initially i was facing Netty server issue with network buffer and i followed this link https://ci.apache.org/projects/flink/flink-docs-release-1.11/ops/config.html#taskmanager-network-memory-floating-buffers-per-gate flink network and heap memory optimizations and applied below settings and everything is working fine
taskmanager.network.memory.min: 256mb
taskmanager.network.memory.max: 1024mb
taskmanager.network.memory.buffers-per-channel: 8
taskmanager.memory.segment-size: 2mb
taskmanager.network.memory.floating-buffers-per-gate: 16
cluster.evenly-spread-out-slots: true
taskmanager.heap.size: 1024m
taskmanager.memory.framework.heap.size: 64mb
taskmanager.memory.managed.fraction: 0.7
taskmanager.memory.framework.off-heap.size: 64mb
taskmanager.memory.network.fraction: 0.4
taskmanager.memory.jvm-overhead.min: 256mb
taskmanager.memory.jvm-overhead.max: 1gb
taskmanager.memory.jvm-overhead.fraction: 0.4
But i have two below questions
If any task manager restarts because of any failures the task manager is restarting successfully and getting registered with job manager but after the restarted task manager don't perform any processing of data it will sit idle. Is this normal flink behavior or do i need to add any setting to make task manager to start processing again.
Sorry and correct me if my understanding is wrong, flink has a restart strategy in my code i made limit 5 attempts of restart. What will happen if my flink job is not successfully overcomes the task failure entire flink job will be remained in idle state and i have to restart job manually or is there any mechanism i can add to restart my job even after it crossed the limit of restart job attempts.
Is there any document to calculate the number of cores and memory i should assign to flink job cluster based on data size and rate at which my system receives the data ?
Is there any documentation on flink CEP optimization techniques?
This is the error stack trace i am seeing in job manager
I am seeing the below errors in my job manager logs before the pattern matching
Caused by: org.apache.flink.runtime.io.network.netty.exception.RemoteTransportException: Connection unexpectedly closed by remote task manager '/10.244.9.163:46377'. This might indicate that the remote task manager was lost.
at org.apache.flink.runtime.io.network.netty.CreditBasedPartitionRequestClientHandler.channelInactive(CreditBasedPartitionRequestClientHandler.java:136)
at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:257)
at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:243)
at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:236)
at org.apache.flink.shaded.netty4.io.netty.handler.codec.ByteToMessageDecoder.channelInputClosed(ByteToMessageDecoder.java:393)
at org.apache.flink.shaded.netty4.io.netty.handler.codec.ByteToMessageDecoder.channelInactive(ByteToMessageDecoder.java:358)
at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:257)
at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:243)
at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:236)
at org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline$HeadContext.channelInactive(DefaultChannelPipeline.java:1416)
at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:257)
at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:243)
at org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline.fireChannelInactive(DefaultChannelPipeline.java:912)
at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel$AbstractUnsafe$8.run(AbstractChannel.java:816)
at org.apache.flink.shaded.netty4.io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163)
at org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:416)
at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:515)
at org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:918)
at org.apache.flink.shaded.netty4.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
at java.lang.Thread.run(Thread.java:748)
Thanks in advance, please help me in resolving my doubts
Various points:
If your patterns involve matching temporal sequences (e.g., "A followed by B"), then you need state to do this. Most of Flink's sources and sinks also use state internally to record offsets, etc., and this state needs to be checkpointed if you care about exactly-once guarantees. If the patterns are being streamed in dynamically, then you'll want to store the patterns in Flink state as well.
Some of the comments in the code don't match the configuration parameters: e.g., "500 ms of progress" vs. 1000, "checkpoints have to complete within one minute" vs 120000. Also, keep in mind that the section of the documentation that you copied these settings from is not recommending best practices, but is instead illustrating how to make changes. In particular, env.getCheckpointConfig().setPreferCheckpointForRecovery(true); is a bad idea, and that config option should probably not exist.
Some of your entries in config.yaml are concerning. taskmanager.memory.managed.fraction is rather large (0.7) -- this only makes sense if you are using RocksDB, since managed memory has no other purpose for streaming. And taskmanager.memory.network.fraction and taskmanager.memory.jvm-overhead.fraction are both very large, and the sum of these three fractions is 1.5, which doesn't make sense.
In general the default network configuration works well across a wide range of deployment scenarios, and it is unusual to need to tune these settings, except in large clusters (which is not the case here). What sort of problems did you encounter?
As for your questions:
After a TM failure and recovery, the TMs should automatically resume processing from the most recent checkpoint. To diagnose why this isn't happening, we'll need more information. To gain experience with a deployment that handles this correctly, you can experiment with the Flink Operations Playground.
Once the configured restart strategy has played itself out, the job will FAIL, and Flink will no longer try to recover that job. You can, of course, build your own automation on top of Flink's REST API, if you want something more sophisticated.
Documentation on capacity planning? No, not really. This is generally figured out through trial and error. Different applications tend to have different requirements in ways that are difficult to anticipate. Things like your choice of serializer, state backend, number of keyBys, the sources and sinks, key skew, watermarking, and so on can all have significant impacts.
Documentation on optimizing CEP? No, sorry. The main points are
do everything you can to constrain the matches; avoid patterns that must keep state indefinitely
getEventsForPattern can be expensive
everyone.
Please help me.
I write apache flink streraming job, which reads json messages from apache kafka (500-1000 messages in seconds), deserialize them in POJO and performs some operations (filter-keyby-process-sink). I used RocksDB state backend with ExactlyOnce semantic. But I do not understand which checkpointing interval I need set?
Some forums peoples write mostly 1000 or 5000 ms.
I tried to set interval 10ms, 100ms, 500ms, 1000ms, 5000ms. I have not noticed any differences.
Two factors argue in favor of a reasonably small checkpoint interval:
(1) If you are using a sink that does two-phase transactional commits, such as Kafka or the StreamingFileSink, then those transactions will only be committed during checkpointing. Thus any downstream consumers of the output of your job will experience latency that is governed by the checkpoint interval.
Note that you will not experience this delay with Kafka unless you have taken all of the steps required to have exactly-once semantics, end-to-end. This means that you must set Semantic.EXACTLY_ONCE in the Kafka producer, and set the isolation.level in downstream consumers to read_committed. And if you are doing this, you should also increase transaction.max.timeout.ms beyond the default (which is 15 minutes). See the docs for more.
(2) If your job fails and needs to recover from a checkpoint, the inputs will be rewound to the offsets recorded in the checkpoint, and processing will resume from there. If the checkpoint interval is very long (e.g., 30 minutes), then your job may take quite a while to catch back up to the point where it is once again processing events in near real-time (assuming you are processing live data).
On the other hand, checkpointing does add some overhead, so doing it more often than necessary has an impact on performance.
In addition to the points described by #David, my suggestion is also to use the following function to configure the checkpoint time:
StreamExecutionEnvironment.getCheckpointConfig().setMinPauseBetweenCheckpoints(milliseconds)
This way, you guarantee that your job will be able to make some progress in case the state gets bigger than planned or the storage where the checkpoints are made is slow.
I recommend reading the Flink documentation on Tuning Checkpointing to better understand these scenarios.
I have a standalone cluster where there is a Flink streaming job with 1-hour event time windows. After 2-3 hour of a run, the job dies with the "org.apache.flink.util.FlinkException: The assigned slot ... was removed" exception.
The job is working well when my windows are only 15minutes.
How can the job recover after losing a slot?
Is it possible to run the same calculations on multiple slots to prevent this error?
Shall I increase any of the timeouts? if so which one?
Flink streaming job recovers from failures from checkpoint. If your checkpoint is externalized, for example in S3. You can manually or ask Flink automatically recover from the most recent checkpoint.
Depends on your upstream message queuing service, you will likely get duplicated messages. So it's good to make your ingestion idempotent.
Also, the slot removed failure can be the symptom of various failures.
underlying hardware
network
memory pressure
What do you see in the task manager log that was removed?
I am setting up an analytics pipeline using Apache Flink to process a stream of IoT data. While attempting to configure the system, I cannot seem to find any sources for how often checkpointing should be initiated? Are there any recommendations or hard-and-fast rules of thumb? e.g. 1 second, 10 seconds, 1 minutes, etc.?
EDIT: Also, is there a way of programmatically configuring the checkpoint interval at runtime?
This depends on two things:
How much data are you willing to reprocess in the case of failure (The job will restarts from the last completed checkpoint)?
How often are you able to checkpoint due to data transfer limits and the duration of the checkpoint itself?
In my experience most users use checkpoint intervals in the order of 10 seconds, but also configure a "min-pause-between-checkpoints" [1].
[1] https://ci.apache.org/projects/flink/flink-docs-release-1.9/dev/stream/state/checkpointing.html#enabling-and-configuring-checkpointing
One other thing to consider beyond what was already mentioned: if you are depending on a transactional sink for exactly-once semantics, then those transactions will be committed as part of completing each checkpoint. This means that any downstream consumers of those transactions will experience latency that is more-or-less determined by the checkpointing interval of your job.
I was running the Flink 1.8 WordCount example job on Kubernetes, I noticed a behavior. Sometimes, a TaskManager pod gets OOMKilled and restarted (it is not a concern for now) but the whole job fails, the JobManager log shows The assigned slot XXX was removed.
My question is, why does the whole job fail? Is there a way that I can configure Flink to make the job more tolerant to transient TaskManager failures?
Apache Flink's fault tolerance mechanism is based on periodic checkpoints and can guarantee exactly-once state consistency, i.e., after recovering from a failure, the state is consistent and the same as if the failure never happened (assuming deterministic application logic of course).
In order to achieve this, Flink takes consistent snapshots of the application's state (so-called checkpoints) in regular intervals. In case of a failure, the whole application is reset to the latest competed checkpoint. For that, Flink (until Flink 1.8) always restarts the whole application. A failure is any reason that terminates a worker process, including application failure, JVM OOM, killed container, hardware failure, etc.
In Flink 1.9 (released a week ago, see announcement), Flink adds so-called failover regions (see here), which can reduce the number restarted tasks. For continuous streaming applications, this only applies if the application does not have a shuffle (keyBy, broadcast, partition, ...) operation. In that case, only the affected pipeline is restarted and all other pipelines continue processing data.
Running Flink jobs you should do a capacity plan previously, otherwise, you will meet the OOM problems frequently, in kubernetes environment you should calculate how many memories your job will cost and set the resources.limits.memory of your deployment higher than it as well as the resources.requests.memory, if the resources.requests.memory is much lower than your job actually cost your Pod will be fall in Evicted state this will cause your job to fail as well.
A container in a Pod may fail due to number of reasons like process in it exited with a non-zero exit code, or the container was killed for exceeding a memory limit
You can use the Jobs specification
.spec.template.spec.restartPolicy = "OnFailure"
So using this pod will stay in the system and container will re-run.
For more information on also check official job documentation : https://kubernetes.io/docs/concepts/workloads/controllers/jobs-run-to-completion/