Checkpointing Getting Failed In Flink Streaming Job(Table/Sql Api)

Checkpointing Getting Failed In Flink Streaming Job(Table/Sql Api) - apache-flink

My job Flow Works like:
Src[Kafka] -> Lookup With Mysql -> Deduplication(Using Top N on proc time)-> Upsert Kafka/Mysql
But my job is running fine data is flowing perfectly to Kafka and Mysql but it is failing on checkpoint, Attached image for the same.
Ps : for the time being I have disabled the checkpointing but when I enable with same properties it fails

The checkpoint is failing because it is timing out. The typical cause of checkpoint timeouts is backpressure that prevents the checkpoint barriers from making sufficiently rapid progress across the execution graph. Another possibility is inadequate bandwidth or quota for writing to the checkpoint storage.
Some ideas:
increase the timeout (the default timeout is 10 minutes; yours has been reduced to 2 minutes)
enable unaligned checkpoints (this should lessen the impact of backpressure on checkpoint times)
find the cause of the backpressure and alleviate it (the mysql lookup is an obvious candidate)
examine the parallel subtasks for evidence of asymmetries in checkpoint sizes, alignment times, etc. indicating skew in the processing caused by hot keys, or misaligned watermarks, or other clues

Related

Flink committing to kafka takes longer than the checkpoint interval

I'm having issues understanding why my flink job commits to kafka consumer is taking so long. I have a checkpoint of 1s and the following warning appears. I'm currently using version 1.14.
Committing offsets to Kafka takes longer than the checkpoint interval. Skipping commit of previous offsets because newer complete checkpoint offsets are available. This does not compromise Flink's checkpoint integrity
Compared to some Kafka streams we have running, the commit latency takes around 100 ms.
Can you point me in the right direction? Are there any metrics that I can look at?
I tried to find metrics that could help to debug this

Since Flink is continually committing offsets (sometimes overlapping in the cases of longer-running commits), network related blips and other external issues that cause the checkpoint to take longer can result in what you are seeing (a subsequent checkpoint is completed prior to the success of the previous one).
There are a handful of useful metrics related to checkpointing that you may want to explore that might help determine what's occurring:
lastCheckpointDuration - The time it took to complete the last checkpoint (in milliseconds).
lastCheckpointSize - The checkpointed size of the last checkpoint (in bytes), this metric could be different from lastCheckpointFullSize if incremental checkpoint or changelog is enabled.
Monitoring these as well as some of the other checkpointing metrics, along with task/job manager logs, might help you piece together a story for what caused the slower commit to take so long.
If you find that you are continually encountering this, you may look at adjusting the checkpointing configuration for the job to tolerate these longer durations.

Flink Statefun Under Backpressure application crashes

I'm reading data from a kafka topic which has lots of data. Once flink starts reading, it starts up fine and then crashes after some time, when backpressure hits 100%, and then goes in an endless cycle of restarts.
My question is shouldn't flink's backpressure mechanism come into play and reduce consumption from topic till inflight data is consumed or backpressure reduces, as stated in this article: https://www.ververica.com/blog/how-flink-handles-backpressure? Or do i need to give some config which I'm missing? Is there any other solution to avoid this restart cycle when backpressure increases?
I've tried configs
taskmanager.network.memory.buffer-debloat.enabled: true
taskmanager.network.memory.buffer-debloat.samples: 5
My modules.yaml has this config for transportation
spec:
functions: function_name
urlPathTemplate: http://nginx:8888/function_name
maxNumBatchRequests: 50
transport:
timeouts:
call: 2 min
connect: 2 min
read: 2 min
write: 2 min

You should look in the logs to determine what exactly is causing of the crash and restart, but typically when backpressure is involved in a restart it's because a checkpoint timed out. You could increase the checkpoint timeout.
However, it's better if you can reduce/eliminate the backpressure.
One common cause of backpressure is not providing Flink with enough resources to keep up with the load. If this is happening regularly, you should consider scaling up the parallelism. Or it may be that the egress is under-provisioned.
I see you've already tried buffer debloating (which should help). You can also try enabling unaligned checkpoints.
See https://nightlies.apache.org/flink/flink-docs-release-1.15/docs/ops/state/checkpointing_under_backpressure/ for more information.

At_Least_Once Checkpoint With S3 Source seems to block the event source for checkpointing duration

i am trying to build a system with real-time streaming processing with flink having s3 as source and elastic as sink.
i have tried out 4 cases for checkpoints in total.
Exactly_Once with Aligned Checkpoints.
Exactly_Once with unAligned Checkpoints.
At_Least_Once with max 1 concurrent Checkpoint.
At_Least_Once with max 2 concurrent Checkpoint.
Exactly_Once with unAligned Checkpoints seems to have the least delay in Publishing to Sink.
While Delay for the remaining three Seems to be similar.
As Per docs: At_Least_Once should not block the events for one stream during checkpointing in case of delay in alignment.
is this behaviour altered in case of file system based sources?
Details about the job:--
we have another service that is writing files to S3 in real time.
the part files are getting closed every 1 min duration.
flink job is consuming from this s3 path using env.readFile in PROCESS_CONTINUOUSLY mode with window size of 30s.
we were expecting a max processing delay of 5m, but with
case 2:-- we are observing 8-10m of delay.
case 1,3,4 :-- delay of 10-14m.
we are running this job with 16 similar sources.
i am able to see that checkpoint delay is due to backpressure from two of the sources. whose tps is 180 and 90 respectively and their alignment delays are ~7m and ~6m.
however we are able to see that resource consumption remains pretty stable during the entire period. memory spike is to max 70% of heap.

Ingesting from S3 in this fashion performs poorly and is expensive (because it's doing ListObjects for each iteration).
A better solution would be to use a custom SQS source (AFAIK there is no official one) using Amazon S3 Event Notification. Here is a sample implementation.

Checkpointing issues in Flink 1.10.1 using RocksDB state backend

We are experiencing a very difficult-to-observe problem with our Flink job.
The Job is reasonably simple, it:
Reads messages from Kinesis using the Flink Kinesis connector
Keys the messages and distributes them to ~30 different CEP operators, plus a couple of custom WindowFunctions
The messages emitted from the CEP/Windows are forward to a SinkFunction that writes messages to SQS
We are running Flink 1.10.1 Fargate, using 2 containers with 4vCPUs/8GB, we are using the RocksDB state backend with the following configuration:
state.backend: rocksdb
state.backend.async: true
state.backend.incremental: false
state.backend.rocksdb.localdir: /opt/flink/rocksdb
state.backend.rocksdb.ttl.compaction.filter.enabled: true
state.backend.rocksdb.files.open: 130048
The job runs with a parallelism of 8.
When the job starts from cold, it uses very little CPU and checkpoints complete in 2 sec. Over time, the checkpoint sizes increase but the times are still very reasonable couple of seconds:
During this time we can observe the CPU usage of our TaskManagers gently growing for some reason:
Eventually, the checkpoint time will start spiking to a few minutes, and then will just start repeatedly timing out (10 minutes). At this time:
Checkpoint size (when it does complete) is around 60MB
CPU usage is high, but not 100% (usually around 60-80%)
Looking at in-progress checkpoints, usually 95%+ of operators complete the checkpoint with 30 seconds, but a handful will just stick and never complete. The SQS sink will always be included on this, but the SinkFunction is not rich and has no state.
Using the backpressure monitor on these operators reports a HIGH backpressure
Eventually this situation resolves one of 2 ways:
Enough checkpoints fail to trigger the job to fail due to a failed checkpoint proportion threshold
The checkpoints eventually start succeeding, but never go back down to the 5-10s they take initially (when the state size is more like 30MB vs. 60MB)
We are really at a loss at how to debug this. Our state seems very small compared to the kind of state you see in some questions on here. Our volumes are also pretty low, we are very often under 100 records/sec.
We'd really appreciate any input on areas we could look into to debug this.
Thanks,

A few points:
It's not unusual for state to gradually grow over time. Perhaps your key space is growing, and you are keeping some state for each key. If you are relying on state TTL to expire stale state, perhaps it is not configured in a way that allows it clean up expired state as quickly as you would expect. It's also relatively easy to inadvertently create CEP patterns that need to keep some state for a very long time before certain possible matches can be ruled out.
A good next step would be to identify the cause of the backpressure. The most common cause is that a job doesn't have adequate resources. Most jobs gradually come to need more resources over time, as the number of users (for example) being managed rises. For example, you might need to increase the parallelism, or give the instances more memory, or increase the capacity of the sink(s) (or the speed of the network to the sink(s)), or give RocksDB faster disks.
Besides being inadequately provisioned, other causes of backpressure include
blocking i/o is being done in a user function
a large number of timers are firing simultaneously
event time skew between different sources is causing large amounts of state to be buffered
data skew (a hot key) is overwhelming one subtask or slot
lengthy GC pauses
contention for critical resources (e.g., using a NAS as the local disk for RocksDB)
Enabling RocksDB native metrics might provide some insight.

Add this property to your configuration:
state.backend.rocksdb.checkpoint.transfer.thread.num: {threadNumberAccordingYourProjectSize}
if you do not add this , it will be 1 (default)
Link: https://github.com/apache/flink/blob/master/flink-state-backends/flink-statebackend-rocksdb/src/main/java/org/apache/flink/contrib/streaming/state/RocksDBOptions.java#L62

Which set checkpointing interval (ms)?

everyone.
Please help me.
I write apache flink streraming job, which reads json messages from apache kafka (500-1000 messages in seconds), deserialize them in POJO and performs some operations (filter-keyby-process-sink). I used RocksDB state backend with ExactlyOnce semantic. But I do not understand which checkpointing interval I need set?
Some forums peoples write mostly 1000 or 5000 ms.
I tried to set interval 10ms, 100ms, 500ms, 1000ms, 5000ms. I have not noticed any differences.

Two factors argue in favor of a reasonably small checkpoint interval:
(1) If you are using a sink that does two-phase transactional commits, such as Kafka or the StreamingFileSink, then those transactions will only be committed during checkpointing. Thus any downstream consumers of the output of your job will experience latency that is governed by the checkpoint interval.
Note that you will not experience this delay with Kafka unless you have taken all of the steps required to have exactly-once semantics, end-to-end. This means that you must set Semantic.EXACTLY_ONCE in the Kafka producer, and set the isolation.level in downstream consumers to read_committed. And if you are doing this, you should also increase transaction.max.timeout.ms beyond the default (which is 15 minutes). See the docs for more.
(2) If your job fails and needs to recover from a checkpoint, the inputs will be rewound to the offsets recorded in the checkpoint, and processing will resume from there. If the checkpoint interval is very long (e.g., 30 minutes), then your job may take quite a while to catch back up to the point where it is once again processing events in near real-time (assuming you are processing live data).
On the other hand, checkpointing does add some overhead, so doing it more often than necessary has an impact on performance.

In addition to the points described by #David, my suggestion is also to use the following function to configure the checkpoint time:
StreamExecutionEnvironment.getCheckpointConfig().setMinPauseBetweenCheckpoints(milliseconds)
This way, you guarantee that your job will be able to make some progress in case the state gets bigger than planned or the storage where the checkpoints are made is slow.
I recommend reading the Flink documentation on Tuning Checkpointing to better understand these scenarios.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

Checkpointing Getting Failed In Flink Streaming Job(Table/Sql Api) - apache-flink

Related

Flink committing to kafka takes longer than the checkpoint interval

Flink Statefun Under Backpressure application crashes

At_Least_Once Checkpoint With S3 Source seems to block the event source for checkpointing duration

Checkpointing issues in Flink 1.10.1 using RocksDB state backend

Which set checkpointing interval (ms)?

Categories

Resources