Flink Exponential Backoff On Just One Task - apache-flink

I have a Flink app with multiple tasks. In the event that one of those tasks has an error during processing, I'd like to do an exponential backoff on that one task without restarting the whole job. When using Kafka directly rather than through Flink, I can pause the consumer and then resume it later after a certain amount of time has passed. Is it possible to pause a Flink data source or task? Is there another way to accomplish an exponential backoff on just one task while not affecting the other tasks?

In general Flink itself does not offer such capability. Now, it may be possible to mimic such capability in some operators like specific sinks or AsyncIO. For Kafka, for example You may setup producer in a way that it will retry failed messages and wait given amount of time before any subsequent retry. This isn't exactly exponential backoff, but as close as You can get without writing own sink.
So, it generally depends on the place You want to achieve that, it may be possible to have backoff that is not exponential out-of-the-box. As a last resort, You may try to simply write Your own sink that implements exponential backoff.

Related

flink consumes rabbitmq messages in parallel, how to ensure sequential consumption

I listen to mysql binlog through flink, then drop it to rabbitmq queue, consume rabbitmq messages in flink, set parallelism to 1 for sequential consumption of messages, but this will cause flink task oom, is there any way to support multiple parallelism and consume sequentially? Please advise, thanks!
According to your description of the problem, It seems like you want to use multiple event sources and process them sequentially.
But it depends on what order that sequence is in.
You may check the concept of time semantics in flink.
If you can define event time for each event sent from multiple parallel sources, you can use Event Time Semantics together with AssignedWatermark.
So that when flink received them, it knows to process them in event time order regardless of the time flink receive them ( which is processing time).
Keywords are: Event Time (which is the default) and Processing Time

Flink message retries like Storm

I am trying to build a Flink job that would read data from a Kafka source do a bunch of processing including few REST calls and then finally sink into another Kafka topic.
The problem I trying to address is that of message retries. What if there are transient errors in the REST API? How can I do exponential backoff-based retry of these messages like the way Storm supports?
I have 2 approaches that I could think off
Use TimerService but then in case of failures the state will start to expand uncontrollably.
Write failed message to a different Kafka topic and process them with a delay of sorts, but here the problem can arise if the Sink itself is down for few minutes?
Is there a better more robust and simpler way to achieve this?
I would use Flink's AsyncFunction to make the REST calls. If needed, it will backpressure the source(s) rather than use more than a configured amount of state. For retries, see AsyncFunction retries.

Is it possible to have a window per sub-task/partition

I am working with Flink using data from a Kafka topic that has multiple partitions. Is it possible to have a window on each parallel sub-task/partition without having to use keyBy (as I want to avoid the shuffle). Based on the documentation, I can only choose between keyed windows (which requires a shuffle) or global windows (which reduces parallelism to 1).
The motivation is that I want to use a CountWindow to batch the messages with a custom trigger that also fires after a set amount of processing time. So per Kafka partition, I want to batch N records together or wait X amount of processing time before sending the batch downstream.
Thanks!
There's no good way to do that.
One workaround would be to implement the batching and timeout logic in a custom sink. You'd want to implement the CheckpointedFunction interface to make your solution fault tolerant, and you could use the Sink.ProcessingTimeService.ProcessingTimeCallback interface for the timeouts.
UPDATE:
Just thought of another solution, similar to the one in your comment below. You could implement a custom source that sends a periodic heartbeat, and broadcast that to a BroadcastProcessFunction.

Which set checkpointing interval (ms)?

everyone.
Please help me.
I write apache flink streraming job, which reads json messages from apache kafka (500-1000 messages in seconds), deserialize them in POJO and performs some operations (filter-keyby-process-sink). I used RocksDB state backend with ExactlyOnce semantic. But I do not understand which checkpointing interval I need set?
Some forums peoples write mostly 1000 or 5000 ms.
I tried to set interval 10ms, 100ms, 500ms, 1000ms, 5000ms. I have not noticed any differences.
Two factors argue in favor of a reasonably small checkpoint interval:
(1) If you are using a sink that does two-phase transactional commits, such as Kafka or the StreamingFileSink, then those transactions will only be committed during checkpointing. Thus any downstream consumers of the output of your job will experience latency that is governed by the checkpoint interval.
Note that you will not experience this delay with Kafka unless you have taken all of the steps required to have exactly-once semantics, end-to-end. This means that you must set Semantic.EXACTLY_ONCE in the Kafka producer, and set the isolation.level in downstream consumers to read_committed. And if you are doing this, you should also increase transaction.max.timeout.ms beyond the default (which is 15 minutes). See the docs for more.
(2) If your job fails and needs to recover from a checkpoint, the inputs will be rewound to the offsets recorded in the checkpoint, and processing will resume from there. If the checkpoint interval is very long (e.g., 30 minutes), then your job may take quite a while to catch back up to the point where it is once again processing events in near real-time (assuming you are processing live data).
On the other hand, checkpointing does add some overhead, so doing it more often than necessary has an impact on performance.
In addition to the points described by #David, my suggestion is also to use the following function to configure the checkpoint time:
StreamExecutionEnvironment.getCheckpointConfig().setMinPauseBetweenCheckpoints(milliseconds)
This way, you guarantee that your job will be able to make some progress in case the state gets bigger than planned or the storage where the checkpoints are made is slow.
I recommend reading the Flink documentation on Tuning Checkpointing to better understand these scenarios.

Flink, basic rule for checkpointing?

I have 2 questions regarding Flink checkpointing strategy,
I know that checkpoint is related to state (right?), so if I'm not using state (ValueState sort of things) explicitly in my job code, do I need to care about checkpoint? Is it still necessary?
If I need to enable the checkpointing, what should the interval be? Are there any basic rules for setting the interval? Suppose we're talking about a quite busy system (Kafka+Flink), like several billions messages per day.
Many thanks.
Even if you are not using state explicitly in your application, Flink's Kafka source and sink connectors are using state on your behalf in order to provide you with either at-least-once or exactly-once guarantees -- assuming you care about those guarantees. Also, some other operators will also use state somewhat transparently, on your behalf, such as windows and other streaming aggregations.
If your Flink job fails, then it will be rewound back to the most recent successful checkpoint, and resume processing from there. So, for example, if your checkpoint interval is 10 minutes, then after recovery your job might have 10+ minutes of data to catch up on before it can resume processing live data. So choose a checkpoint interval that you can live with from this perspective.

Resources