Flink Statefun Under Backpressure application crashes - apache-flink

I'm reading data from a kafka topic which has lots of data. Once flink starts reading, it starts up fine and then crashes after some time, when backpressure hits 100%, and then goes in an endless cycle of restarts.
My question is shouldn't flink's backpressure mechanism come into play and reduce consumption from topic till inflight data is consumed or backpressure reduces, as stated in this article: https://www.ververica.com/blog/how-flink-handles-backpressure? Or do i need to give some config which I'm missing? Is there any other solution to avoid this restart cycle when backpressure increases?
I've tried configs
taskmanager.network.memory.buffer-debloat.enabled: true
taskmanager.network.memory.buffer-debloat.samples: 5
My modules.yaml has this config for transportation
spec:
functions: function_name
urlPathTemplate: http://nginx:8888/function_name
maxNumBatchRequests: 50
transport:
timeouts:
call: 2 min
connect: 2 min
read: 2 min
write: 2 min

You should look in the logs to determine what exactly is causing of the crash and restart, but typically when backpressure is involved in a restart it's because a checkpoint timed out. You could increase the checkpoint timeout.
However, it's better if you can reduce/eliminate the backpressure.
One common cause of backpressure is not providing Flink with enough resources to keep up with the load. If this is happening regularly, you should consider scaling up the parallelism. Or it may be that the egress is under-provisioned.
I see you've already tried buffer debloating (which should help). You can also try enabling unaligned checkpoints.
See https://nightlies.apache.org/flink/flink-docs-release-1.15/docs/ops/state/checkpointing_under_backpressure/ for more information.

Related

Flink committing to kafka takes longer than the checkpoint interval

I'm having issues understanding why my flink job commits to kafka consumer is taking so long. I have a checkpoint of 1s and the following warning appears. I'm currently using version 1.14.
Committing offsets to Kafka takes longer than the checkpoint interval. Skipping commit of previous offsets because newer complete checkpoint offsets are available. This does not compromise Flink's checkpoint integrity
Compared to some Kafka streams we have running, the commit latency takes around 100 ms.
Can you point me in the right direction? Are there any metrics that I can look at?
I tried to find metrics that could help to debug this
Since Flink is continually committing offsets (sometimes overlapping in the cases of longer-running commits), network related blips and other external issues that cause the checkpoint to take longer can result in what you are seeing (a subsequent checkpoint is completed prior to the success of the previous one).
There are a handful of useful metrics related to checkpointing that you may want to explore that might help determine what's occurring:
lastCheckpointDuration - The time it took to complete the last checkpoint (in milliseconds).
lastCheckpointSize - The checkpointed size of the last checkpoint (in bytes), this metric could be different from lastCheckpointFullSize if incremental checkpoint or changelog is enabled.
Monitoring these as well as some of the other checkpointing metrics, along with task/job manager logs, might help you piece together a story for what caused the slower commit to take so long.
If you find that you are continually encountering this, you may look at adjusting the checkpointing configuration for the job to tolerate these longer durations.

Checkpointing Getting Failed In Flink Streaming Job(Table/Sql Api)

My job Flow Works like:
Src[Kafka] -> Lookup With Mysql -> Deduplication(Using Top N on proc time)-> Upsert Kafka/Mysql
But my job is running fine data is flowing perfectly to Kafka and Mysql but it is failing on checkpoint, Attached image for the same.
Ps : for the time being I have disabled the checkpointing but when I enable with same properties it fails
The checkpoint is failing because it is timing out. The typical cause of checkpoint timeouts is backpressure that prevents the checkpoint barriers from making sufficiently rapid progress across the execution graph. Another possibility is inadequate bandwidth or quota for writing to the checkpoint storage.
Some ideas:
increase the timeout (the default timeout is 10 minutes; yours has been reduced to 2 minutes)
enable unaligned checkpoints (this should lessen the impact of backpressure on checkpoint times)
find the cause of the backpressure and alleviate it (the mysql lookup is an obvious candidate)
examine the parallel subtasks for evidence of asymmetries in checkpoint sizes, alignment times, etc. indicating skew in the processing caused by hot keys, or misaligned watermarks, or other clues

Checkpointing issues in Flink 1.10.1 using RocksDB state backend

We are experiencing a very difficult-to-observe problem with our Flink job.
The Job is reasonably simple, it:
Reads messages from Kinesis using the Flink Kinesis connector
Keys the messages and distributes them to ~30 different CEP operators, plus a couple of custom WindowFunctions
The messages emitted from the CEP/Windows are forward to a SinkFunction that writes messages to SQS
We are running Flink 1.10.1 Fargate, using 2 containers with 4vCPUs/8GB, we are using the RocksDB state backend with the following configuration:
state.backend: rocksdb
state.backend.async: true
state.backend.incremental: false
state.backend.rocksdb.localdir: /opt/flink/rocksdb
state.backend.rocksdb.ttl.compaction.filter.enabled: true
state.backend.rocksdb.files.open: 130048
The job runs with a parallelism of 8.
When the job starts from cold, it uses very little CPU and checkpoints complete in 2 sec. Over time, the checkpoint sizes increase but the times are still very reasonable couple of seconds:
During this time we can observe the CPU usage of our TaskManagers gently growing for some reason:
Eventually, the checkpoint time will start spiking to a few minutes, and then will just start repeatedly timing out (10 minutes). At this time:
Checkpoint size (when it does complete) is around 60MB
CPU usage is high, but not 100% (usually around 60-80%)
Looking at in-progress checkpoints, usually 95%+ of operators complete the checkpoint with 30 seconds, but a handful will just stick and never complete. The SQS sink will always be included on this, but the SinkFunction is not rich and has no state.
Using the backpressure monitor on these operators reports a HIGH backpressure
Eventually this situation resolves one of 2 ways:
Enough checkpoints fail to trigger the job to fail due to a failed checkpoint proportion threshold
The checkpoints eventually start succeeding, but never go back down to the 5-10s they take initially (when the state size is more like 30MB vs. 60MB)
We are really at a loss at how to debug this. Our state seems very small compared to the kind of state you see in some questions on here. Our volumes are also pretty low, we are very often under 100 records/sec.
We'd really appreciate any input on areas we could look into to debug this.
Thanks,
A few points:
It's not unusual for state to gradually grow over time. Perhaps your key space is growing, and you are keeping some state for each key. If you are relying on state TTL to expire stale state, perhaps it is not configured in a way that allows it clean up expired state as quickly as you would expect. It's also relatively easy to inadvertently create CEP patterns that need to keep some state for a very long time before certain possible matches can be ruled out.
A good next step would be to identify the cause of the backpressure. The most common cause is that a job doesn't have adequate resources. Most jobs gradually come to need more resources over time, as the number of users (for example) being managed rises. For example, you might need to increase the parallelism, or give the instances more memory, or increase the capacity of the sink(s) (or the speed of the network to the sink(s)), or give RocksDB faster disks.
Besides being inadequately provisioned, other causes of backpressure include
blocking i/o is being done in a user function
a large number of timers are firing simultaneously
event time skew between different sources is causing large amounts of state to be buffered
data skew (a hot key) is overwhelming one subtask or slot
lengthy GC pauses
contention for critical resources (e.g., using a NAS as the local disk for RocksDB)
Enabling RocksDB native metrics might provide some insight.
Add this property to your configuration:
state.backend.rocksdb.checkpoint.transfer.thread.num: {threadNumberAccordingYourProjectSize}
if you do not add this , it will be 1 (default)
Link: https://github.com/apache/flink/blob/master/flink-state-backends/flink-statebackend-rocksdb/src/main/java/org/apache/flink/contrib/streaming/state/RocksDBOptions.java#L62

Which set checkpointing interval (ms)?

everyone.
Please help me.
I write apache flink streraming job, which reads json messages from apache kafka (500-1000 messages in seconds), deserialize them in POJO and performs some operations (filter-keyby-process-sink). I used RocksDB state backend with ExactlyOnce semantic. But I do not understand which checkpointing interval I need set?
Some forums peoples write mostly 1000 or 5000 ms.
I tried to set interval 10ms, 100ms, 500ms, 1000ms, 5000ms. I have not noticed any differences.
Two factors argue in favor of a reasonably small checkpoint interval:
(1) If you are using a sink that does two-phase transactional commits, such as Kafka or the StreamingFileSink, then those transactions will only be committed during checkpointing. Thus any downstream consumers of the output of your job will experience latency that is governed by the checkpoint interval.
Note that you will not experience this delay with Kafka unless you have taken all of the steps required to have exactly-once semantics, end-to-end. This means that you must set Semantic.EXACTLY_ONCE in the Kafka producer, and set the isolation.level in downstream consumers to read_committed. And if you are doing this, you should also increase transaction.max.timeout.ms beyond the default (which is 15 minutes). See the docs for more.
(2) If your job fails and needs to recover from a checkpoint, the inputs will be rewound to the offsets recorded in the checkpoint, and processing will resume from there. If the checkpoint interval is very long (e.g., 30 minutes), then your job may take quite a while to catch back up to the point where it is once again processing events in near real-time (assuming you are processing live data).
On the other hand, checkpointing does add some overhead, so doing it more often than necessary has an impact on performance.
In addition to the points described by #David, my suggestion is also to use the following function to configure the checkpoint time:
StreamExecutionEnvironment.getCheckpointConfig().setMinPauseBetweenCheckpoints(milliseconds)
This way, you guarantee that your job will be able to make some progress in case the state gets bigger than planned or the storage where the checkpoints are made is slow.
I recommend reading the Flink documentation on Tuning Checkpointing to better understand these scenarios.

Flink checkpoints causes backpressure

I have a Flink job processing data at around 200k qps. Without checkpoints, the job is running fine.
But when I tried to add checkpoints (with interval 50mins), it causes backpressue at the first task, which is adding a key field to each entry, the data lagging goes up constantly as well.
the lagging of my two Kafka topics, first half was having checkpoints enabled, lagging goes up very quickly. second part(very low lagging was having checkpoints disabled, where the lagging is within milliseconds)
I am using at least once checkpoint mode, which should be asynchronized process. Could anyone suggest?
My checkpointing setting
env.enableCheckpointing(1800000,
CheckpointingMode.AT_LEAST_ONCE);
env.getCheckpointConfig().setMaxConcurrentCheckpoints(1);
env.getCheckpointConfig()
.enableExternalizedCheckpoints(
CheckpointConfig.ExternalizedCheckpointCleanup.RETAIN_ON_CANCELLATION);
env.getCheckpointConfig()
.setCheckpointTimeout(10min);
env.getCheckpointConfig()
.setFailOnCheckpointingErrors(
jobConfiguration.getCheckpointConfig().getFailOnCheckpointingErrors());
my job has 128 containers.
With 10mins checkpoint time, following is the stats:
I am trying to use a 30mins checkpoint and see
I was trying to tune memory usage, but it seems not working.
But in the task manager, it's still:
TLDR; it's sometimes hard to analyse the problem. I have two lucky guesses/shots - if you are using RocksDB state backend, you could switch to FsStateBackend - it's usually faster and RocksDB makes most sense with large state sizes, that do not fit into memory (or if you really need incremental checkpointing feature). Second is to fiddle with parallelism, either increasing or decreasing.
I would suspect the same thing that #ArvidHeise wrote. You checkpoint size is not huge, but it's also not trivial. It can add the extra overhead to bring the job over the threshold of barely keeping up with the traffic, to not keeping up and causing the backpressure. If you are under backpressure, the latency will just keep accumulating, so even a change in couple of % of extra overhead can make a difference between end to end latencies of milliseconds vs unbounded ever growing value.
If you can not just simply add more resources, you would have to analyse what's exactly adding this extra over head and what resource is the bottleneck.
Is it CPU? Check CPU usage on the cluster. If it's ~100%, that's the thing you need to optimise for.
Is it IO? Check IO usage on the cluster and compare it against the maximal throughput/number of requests per second that you can achieve.
If both CPU & IO usage is low, you might want to try to increase parallelism, but...
Keep in mind data skew. Backpressure could be caused by a single task and in that case it makes it hard to analyse the problem, as it will be a single bottlenecked thread (on either IO or CPU), not whole machine.
After figuring out what resource is the bottleneck, next question would be why? It might be immediately obvious once you see it, or it might require digging in, like checking GC logs, attaching profiler etc.
Answering those questions could either give you information what you could try to optimise in your job or allow you to tweak configuration or could give us (Flink developers) an extra data point what we could try to optimise on the Flink side.
Any kind of checkpointing adds computation overhead. Most of the checkpointing is asynchronously (as you have stated), but it still adds up general I/O operations. These additional I/O request may, for example, congest your access to external systems. Also if you enable checkpointing, Flink needs to keep track of more information (new vs. already checkpointed).
Have you tried to add more resources to your job? Could you share your whole checkpointing configuration?

Resources