How to handle transient/application failures in Apache Flink? - apache-flink

My Flink processor listens to Kafka and the business logic in processor involves calling external REST services and there are possibilities that the services may be down. I would like to replay the tuple back into the processor and Is there anyway to do it? I have used Storm and we will be able to fail the tuple so that the the tuple will not be acknowledged. So the same tuple will be replayed to the processor.
In Flink, the tuple is being acknowledged automatically once the message is consumed by Flink-Kafka Consumer. There are ways to solve this. One such way is to publish the message back to the same queue/retry queue. But I am looking for a solution similar to Storm.
I know that Flink's Savepoint/Checkpoint will be used for fault tolerance. But in my understanding, the tuples will be replayed win case of the Flink's failure. I would like to get ideas on how to handle transient failures.
Thank you

When interacting with external systems I would recommend to use Flink's async I/O operator. It allows you to execute asynchronous tasks without blocking the execution of an operator.
If you want to retry failed operations without restarting the Flink job from the last successful checkpoint, then I would suggest to implement the retry policy yourself. It could look the following way:
new AsyncFunction<IN, OUT>() {
#Override
public void asyncInvoke(IN input, ResultFuture<OUT> resultFuture) throws Exception {
FutureUtils
.retrySuccessfulWithDelay(
() -> triggerAsyncOperation(input),
Time.seconds(1L),
Deadline.fromNow(Duration.ofSeconds(10L)),
this::decideWhetherToRetry,
new ScheduledExecutorServiceAdapter(new DirectScheduledExecutorService()))
.whenComplete((result, throwable) -> {
if (result != null) {
resultFuture.complete(Collections.singleton(result));
} else {
resultFuture.completeExceptionally(throwable);
}
})
}
}
with triggerAsyncOperation encapsulating your asynchronous operation and decideWhetherToRetry encapsulating your retry strategy. If decideWhetherToRetry returns true, then resultFuture will be completed with the value of this operation attempt.
If resultFuture is completed exceptionally, then it will trigger a failover which will cause the job to restart from that last successful checkpoint.

Related

Flink 1.12.x DataSet --> Flink 1.14.x DataStream

I am trying to migrate from Flink 1.12.x DataSet api to Flink 1.14.x DataStream api. mapPartition is not available in Flink DataStream.
Our Code using Flink 1.12.x DataSet
dataset
.<few operations>
.mapPartition(new SomeMapParitionFn())
.<few more operations>
public static class SomeMapPartitionFn extends RichMapPartitionFunction<InputModel, OutputModel> {
#Override
public void mapPartition(Iterable<InputModel> records, Collector<OutputModel> out) throws Exception {
for (InputModel record : records) {
/*
do some operation
*/
if (/* some condition based on processing *MULTIPLE* records */) {
out.collect(...); // Conditional collect ---> (1)
}
}
// At the end of the data, collect
out.collect(...); // Collect processed data ---> (2)
}
}
(1) - Collector.collect invoked based on some condition after processing few records
(2) - Collector.collect invoked at the end of data
Initially we thought of using flatMap instead of mapPartition, but collector not available in close function.
https://issues.apache.org/jira/browse/FLINK-14709 - Only available in case of chained drivers
How to implement this in Flink 1.14.x DataStream? Please advise...
Note: Our application works with only finite set of data (Batch Mode)
In Flink's DataSet API, a MapPartitionFunction has two parameters. An iterator for the input and a collector for the result of the function. A MapPartitionFunction in a Flink DataStream program would never return from the first function call, because the iterator would iterate over an endless stream of records. However, Flink's internal stream processing model requires that user functions return in order to checkpoint function state. Therefore, the DataStream API does not offer a mapPartition transformation.
In order to implement similar function, you need to define a window over the stream. Windows discretize streams which is somewhat similar to mini batches but windows offer way more flexibility
Solution provided by Zhipeng
One solution could be using a streamOperator to implement BoundedOneInput
interface.
An example code could be found here [1].
[1]
https://github.com/apache/flink-ml/blob/56b441d85c3356c0ffedeef9c27969aee5b3ecfc/flink-ml-core/src/main/java/org/apache/flink/ml/common/datastream/DataStreamUtils.java#L75
Flink user mailing link: https://lists.apache.org/thread/ktck2y96d0q1odnjjkfks0dmrwh7kb3z

Apache Flink : Batch Mode failing for Datastream API's with exception `IllegalStateException: Checkpointing is not allowed with sorted inputs.`

A continuation to this : Flink : Handling Keyed Streams with data older than application watermark
based on the suggestion, I have been trying to add support for Batch in the same Flink application which was using the Datastream API's.
The logic is something like this :
streamExecutionEnvironment.setRuntimeMode(RuntimeExecutionMode.BATCH);
streamExecutionEnvironment.readTextFile("fileName")
.process(process function which transforms input)
.assignTimestampsAndWatermarks(WatermarkStrategy
.<DetectionEvent>forBoundedOutOfOrderness(orderness)
.withTimestampAssigner(
(SerializableTimestampAssigner<Event>) (event, l) -> event.getEventTime()))
.keyBy(keyFunction)
.window(TumblingEventWindows(Time.of(x days))
.process(processWindowFunction);
Based on the public docs, my understanding was that i simply needed to change the source to a bounded one. However the above processing keeps on failing at the event trigger after the windowing step with the below exception :
java.lang.IllegalStateException: Checkpointing is not allowed with sorted inputs.
at org.apache.flink.util.Preconditions.checkState(Preconditions.java:193)
at org.apache.flink.streaming.runtime.tasks.OneInputStreamTask.init(OneInputStreamTask.java:99)
at org.apache.flink.streaming.runtime.tasks.StreamTask.executeRestore(StreamTask.java:552)
at org.apache.flink.streaming.runtime.tasks.StreamTask.runWithCleanUpOnFail(StreamTask.java:647)
at org.apache.flink.streaming.runtime.tasks.StreamTask.restore(StreamTask.java:537)
at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:764)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:571)
at java.base/java.lang.Thread.run(Thread.java:829)
The input file contains the historical events for multiple keys. The data for a given key is sorted, but the overall data is not. I have also added an event at the end of each key with the timestamp = MAX_WATERMARK to indicate end of keyed Stream. I tried it for a single key as well but the processing failed with the same exception.
Note: I have not enabled checkpointing.
I have also tried explicitly disabling checkpointing to no avail.
env.getCheckpointConfig().disableCheckpointing();
EDIT - 1
Adding more details :
I tried changing and using FileSource to read files but still getting the same exception.
environment.fromSource(FileSource.forRecordStreamFormat(new TextLineFormat(), path).build(),
WatermarkStrategy.noWatermarks(),
"Text File")
The first process step and key splitting works. However it fails after that. I tried removing windowing and adding a simple process step but it continues to fail.
There is no explicit Sink. The last process function simply updates a database.
Is there something I'm missing ?
That exception can only be thrown if checkpointing is enabled. Perhaps you can a checkpointing interval configured in flink-conf.yaml?

Flink CEP Event Not triggering

I have implement the CEP Pattern in Flink which is working as expected connecting to local Kafka broker. But when i connecting to cluster based cloud kafka setup, the Flink CEP is not triggering.
final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
//saves checkpoint
env.getCheckpointConfig().enableExternalizedCheckpoints(
CheckpointConfig.ExternalizedCheckpointCleanup.RETAIN_ON_CANCELLATION);
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);
I am using AscendingTimestampExtractor,
consumer.assignTimestampsAndWatermarks(
new AscendingTimestampExtractor<ObjectNode>() {
#Override
public long extractAscendingTimestamp(ObjectNode objectNode) {
long timestamp;
Instant instant = Instant.parse(objectNode.get("value").get("timestamp").asText());
timestamp = instant.toEpochMilli();
return timestamp;
}
});
And also i am getting Warn Message that,
AscendingTimestampExtractor:140 - Timestamp monotony violated: 1594017872227 < 1594017873133
And Also i tried using AssignerWithPeriodicWatermarks and AssignerWithPunctuatedWatermarks none of one is working
I have attached Flink console screenshot where Watermark is not assigning.
Updated flink console screenshot
Could Anyone Help?
CEP must first sort the input stream(s), which it does based on the watermarking. So
the problem could be with watermarking, but you haven't shown us enough to debug the cause. One common issue is having an idle source, which can prevent the watermarks from advancing.
But there are other possible causes. To debug the situation, I suggest you look at some metrics, either in the Flink Web UI or in a metrics system if you have one connected. To begin, check if records are flowing, by looking at numRecordsIn, numRecordsOut, or numRecordsInPerSecond and numRecordsOutPerSecond at different stages of your pipeline.
If there are events, then look at currentOutputWatermark throughout the different tasks of your job to see if event time is advancing.
Update:
It appears you may be calling assignTimestampsAndWatermarks on the Kafka consumer, which will result in per-partition watermarking. In that case, if you have an idle partition, that partition won't produce any watermarks, and that will hold back the overall watermark. Try calling assignTimestampsAndWatermarks on the DataStream produced by the source instead, to see if that fixes things. (Of course, without per-partition watermarking, you won't be able to use an AscendingTimestampExtractor, since the stream won't be in order.)

Non blocking streaming on Flink

Hi, I'm trying to run a Flink job that it should process incoming data as below. In the process operator right after keyBy(), there should be a case that takes too much time according to some property in data. Even though incoming data have different ids (which is used to keyBy() the stream), long processing code in process function blocks other incoming data. I mean the entire stream.
SingleOutputStreamOperator<Envelope> processingStream = deviceStream
.map(e -> (Envelope) e)
.keyBy((KeySelector<Envelope, String>) value -> value.eventId) // key by scenarios
.process(new RuleProcessFunction());
In RuleProcessFunction.java:
...
#Override
public void processElement(Envelope value, Context ctx, Collector<Envelope> out) throws Exception {
//handleEvent(value, ctx, out);
if (value.getEventId().equals("I")) {
System.out.println("hello i");
for (long i = 0; i < 10000000000L; i++) {
}
}
out.collect(value);
}
I expect the long-running code block should not block the entire stream. I know there is AsyncFunction for blocking IO situations but I don't know that it's correct solution for this.
Since you aren't pulling data from an external database like Cassandra, I don't think you need to use an AsyncFunction.
What it could be that you are running the flink job with a single parallelism. Try increasing the parallelism so one core isn't responsible for all of the processing as well as receiving data. Granted, there can still be back pressure if you do this. Since if the core responsible for ingesting data from the source is reading in data faster than the core(s) that are running the processFunction Flink's back pressure handling will slow the rate of ingestion.

Job Queue using Google PubSub

I want to have a simple task queue. There will be multiple consumers running on different machines, but I only want each task to be consumed once.
If I have multiple subscribers taking messages from a topic using the same subscription ID is there a chance that the message will be read twice?
I've tested something along these lines successfully but I'm concerned that there could be synchronization issues.
client = SubscriberClient.create(SubscriberSettings.defaultBuilder().build());
subName = SubscriptionName.create(projectId, "Queue");
client.createSubscription(subName, topicName, PushConfig.getDefaultInstance(), 0);
Thread subscriber = new Thread() {
public void run() {
while (!interrupted()) {
PullResponse response = subscriberClient.pull(subscriptionName, false, 1);
List<ReceivedMessage> messages = response.getReceivedMessagesList();
mess = messasges.get(0);
client.acknowledge(subscriptionName, ImmutableList.of(mess.getAckId()));
doSomethingWith(mess.getMessage().getData().toStringUtf8());
}
}
};
subscriber.start();
In short, yes there is a chance that some messages will be duplicated: GCP promises at-least-once delivery. Exactly-once-delivery is theoretically impossible in any distributed system. You should design your doSomethingWith code to be idempotent if possible so duplicate messages are not a problem.
You should also only acknowledge a message once you have finished processing it: what would happen if your machine dies after acknowledge but before doSomethingWith returns? your message will be lost! (this fundamental idea is why exactly-once delivery is impossible).
If losing messages is preferable to double processing them, you could add a locking process (write a "processed" token to a consistent database), but this can fail if the write is handled before the message is processed. But at this point you might be able to find a messaging technology that is designed for at-most-once, rather than optimised for reliability.

Resources