Bypass Flink CEP Buffer in an EventTime Streaming Environment - apache-flink

I have an EventTime streaming application that uses the CEP library for a basic three-step pattern on a joined stream. The joined stream is a combination of live, watermarked, and windowed data and a stream of historical items outside of the windowing/watermarking.
The setup is similar to the dataArtisans blog post except with the CEP Pattern as the last step.
Our CEP setup looks like this, and worked before adding in the non-timestamped historical stream. The EscalatingAlertEventIterativeCondition makes sure that the previous event match is of a greater level than the next.
Pattern<AlertEvent, ?> pattern = Pattern.<AlertEvent>
begin("one")
.where((AlertEvent event) -> event.level > 0)
.next("two")
.where(new EscalatingAlertEventIterativeCondition("one"))
.next("three")
.where(new EscalatingAlertEventIterativeCondition("two"));
return CEP.pattern(
alertEventStream,
pattern
);
The problem I'm seeing is that CEP is forever buffering (breakpoints within the filter and iterative conditions are now not hit) and that the filtering/selection never happens. I initially thought this could be due to the CEP buffer but am unsure as I am new to both Flink and Flink CEP. Is there any way to avoid the lateness buffer, or does something else look amiss?
Our job graph, where only the top, live stream of data is timestamped and watermarked:

Related

Flink stream going to two sinks based on conditions

Trying to see the possibility of stream going to two sinks based on conditions.
Requirement is stream have events, all events after transformation need to go to one sink ( assume one kafka topic)
And only error events needs to go to another sink ( assume another kafka topic).
did not see use-case of once transformation is done , additional logic putting in sink. Looking if something similar done
The best way to do this is with side outputs.
private static final OutputTag<String> errors = new OutputTag<>("errors") {};
...
// in your main() method
SingleOutputStreamOperator<T> result = events.process(new ProcessFunction());
result.addSink(sink).name("normal output");
result.getSideOutput(errors).addSink(errorSink).name("error output");
...
// in the process function
if (somethingGoesWrong) {
ctx.output(errors, "error message");
}
While there are other ways to split a stream with Flink, side outputs are very flexible (e.g., the side outputs can have different types) and perform well.

Flink 1.12.x DataSet --> Flink 1.14.x DataStream

I am trying to migrate from Flink 1.12.x DataSet api to Flink 1.14.x DataStream api. mapPartition is not available in Flink DataStream.
Our Code using Flink 1.12.x DataSet
dataset
.<few operations>
.mapPartition(new SomeMapParitionFn())
.<few more operations>
public static class SomeMapPartitionFn extends RichMapPartitionFunction<InputModel, OutputModel> {
#Override
public void mapPartition(Iterable<InputModel> records, Collector<OutputModel> out) throws Exception {
for (InputModel record : records) {
/*
do some operation
*/
if (/* some condition based on processing *MULTIPLE* records */) {
out.collect(...); // Conditional collect ---> (1)
}
}
// At the end of the data, collect
out.collect(...); // Collect processed data ---> (2)
}
}
(1) - Collector.collect invoked based on some condition after processing few records
(2) - Collector.collect invoked at the end of data
Initially we thought of using flatMap instead of mapPartition, but collector not available in close function.
https://issues.apache.org/jira/browse/FLINK-14709 - Only available in case of chained drivers
How to implement this in Flink 1.14.x DataStream? Please advise...
Note: Our application works with only finite set of data (Batch Mode)
In Flink's DataSet API, a MapPartitionFunction has two parameters. An iterator for the input and a collector for the result of the function. A MapPartitionFunction in a Flink DataStream program would never return from the first function call, because the iterator would iterate over an endless stream of records. However, Flink's internal stream processing model requires that user functions return in order to checkpoint function state. Therefore, the DataStream API does not offer a mapPartition transformation.
In order to implement similar function, you need to define a window over the stream. Windows discretize streams which is somewhat similar to mini batches but windows offer way more flexibility
Solution provided by Zhipeng
One solution could be using a streamOperator to implement BoundedOneInput
interface.
An example code could be found here [1].
[1]
https://github.com/apache/flink-ml/blob/56b441d85c3356c0ffedeef9c27969aee5b3ecfc/flink-ml-core/src/main/java/org/apache/flink/ml/common/datastream/DataStreamUtils.java#L75
Flink user mailing link: https://lists.apache.org/thread/ktck2y96d0q1odnjjkfks0dmrwh7kb3z

Apache Flink : Batch Mode failing for Datastream API's with exception `IllegalStateException: Checkpointing is not allowed with sorted inputs.`

A continuation to this : Flink : Handling Keyed Streams with data older than application watermark
based on the suggestion, I have been trying to add support for Batch in the same Flink application which was using the Datastream API's.
The logic is something like this :
streamExecutionEnvironment.setRuntimeMode(RuntimeExecutionMode.BATCH);
streamExecutionEnvironment.readTextFile("fileName")
.process(process function which transforms input)
.assignTimestampsAndWatermarks(WatermarkStrategy
.<DetectionEvent>forBoundedOutOfOrderness(orderness)
.withTimestampAssigner(
(SerializableTimestampAssigner<Event>) (event, l) -> event.getEventTime()))
.keyBy(keyFunction)
.window(TumblingEventWindows(Time.of(x days))
.process(processWindowFunction);
Based on the public docs, my understanding was that i simply needed to change the source to a bounded one. However the above processing keeps on failing at the event trigger after the windowing step with the below exception :
java.lang.IllegalStateException: Checkpointing is not allowed with sorted inputs.
at org.apache.flink.util.Preconditions.checkState(Preconditions.java:193)
at org.apache.flink.streaming.runtime.tasks.OneInputStreamTask.init(OneInputStreamTask.java:99)
at org.apache.flink.streaming.runtime.tasks.StreamTask.executeRestore(StreamTask.java:552)
at org.apache.flink.streaming.runtime.tasks.StreamTask.runWithCleanUpOnFail(StreamTask.java:647)
at org.apache.flink.streaming.runtime.tasks.StreamTask.restore(StreamTask.java:537)
at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:764)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:571)
at java.base/java.lang.Thread.run(Thread.java:829)
The input file contains the historical events for multiple keys. The data for a given key is sorted, but the overall data is not. I have also added an event at the end of each key with the timestamp = MAX_WATERMARK to indicate end of keyed Stream. I tried it for a single key as well but the processing failed with the same exception.
Note: I have not enabled checkpointing.
I have also tried explicitly disabling checkpointing to no avail.
env.getCheckpointConfig().disableCheckpointing();
EDIT - 1
Adding more details :
I tried changing and using FileSource to read files but still getting the same exception.
environment.fromSource(FileSource.forRecordStreamFormat(new TextLineFormat(), path).build(),
WatermarkStrategy.noWatermarks(),
"Text File")
The first process step and key splitting works. However it fails after that. I tried removing windowing and adding a simple process step but it continues to fail.
There is no explicit Sink. The last process function simply updates a database.
Is there something I'm missing ?
That exception can only be thrown if checkpointing is enabled. Perhaps you can a checkpointing interval configured in flink-conf.yaml?

Flink CEP Event Not triggering

I have implement the CEP Pattern in Flink which is working as expected connecting to local Kafka broker. But when i connecting to cluster based cloud kafka setup, the Flink CEP is not triggering.
final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
//saves checkpoint
env.getCheckpointConfig().enableExternalizedCheckpoints(
CheckpointConfig.ExternalizedCheckpointCleanup.RETAIN_ON_CANCELLATION);
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);
I am using AscendingTimestampExtractor,
consumer.assignTimestampsAndWatermarks(
new AscendingTimestampExtractor<ObjectNode>() {
#Override
public long extractAscendingTimestamp(ObjectNode objectNode) {
long timestamp;
Instant instant = Instant.parse(objectNode.get("value").get("timestamp").asText());
timestamp = instant.toEpochMilli();
return timestamp;
}
});
And also i am getting Warn Message that,
AscendingTimestampExtractor:140 - Timestamp monotony violated: 1594017872227 < 1594017873133
And Also i tried using AssignerWithPeriodicWatermarks and AssignerWithPunctuatedWatermarks none of one is working
I have attached Flink console screenshot where Watermark is not assigning.
Updated flink console screenshot
Could Anyone Help?
CEP must first sort the input stream(s), which it does based on the watermarking. So
the problem could be with watermarking, but you haven't shown us enough to debug the cause. One common issue is having an idle source, which can prevent the watermarks from advancing.
But there are other possible causes. To debug the situation, I suggest you look at some metrics, either in the Flink Web UI or in a metrics system if you have one connected. To begin, check if records are flowing, by looking at numRecordsIn, numRecordsOut, or numRecordsInPerSecond and numRecordsOutPerSecond at different stages of your pipeline.
If there are events, then look at currentOutputWatermark throughout the different tasks of your job to see if event time is advancing.
Update:
It appears you may be calling assignTimestampsAndWatermarks on the Kafka consumer, which will result in per-partition watermarking. In that case, if you have an idle partition, that partition won't produce any watermarks, and that will hold back the overall watermark. Try calling assignTimestampsAndWatermarks on the DataStream produced by the source instead, to see if that fixes things. (Of course, without per-partition watermarking, you won't be able to use an AscendingTimestampExtractor, since the stream won't be in order.)

Apache Flink DataStream API doesn't have a mapPartition transformation

Spark DStream has mapPartition API, while Flink DataStream API doesn't. Is there anyone who could help explain the reason. What I want to do is to implement a API similar to Spark reduceByKey on Flink.
Flink's stream processing model is quite different from Spark Streaming which is centered around mini batches. In Spark Streaming each mini batch is executed like a regular batch program on a finite set of data, whereas Flink DataStream programs continuously process records.
In Flink's DataSet API, a MapPartitionFunction has two parameters. An iterator for the input and a collector for the result of the function. A MapPartitionFunction in a Flink DataStream program would never return from the first function call, because the iterator would iterate over an endless stream of records. However, Flink's internal stream processing model requires that user functions return in order to checkpoint function state. Therefore, the DataStream API does not offer a mapPartition transformation.
In order to implement functionality similar to Spark Streaming's reduceByKey, you need to define a keyed window over the stream. Windows discretize streams which is somewhat similar to mini batches but windows offer way more flexibility. Since a window is of finite size, you can call reduce the window.
This could look like:
yourStream.keyBy("myKey") // organize stream by key "myKey"
.timeWindow(Time.seconds(5)) // build 5 sec tumbling windows
.reduce(new YourReduceFunction); // apply a reduce function on each window
The DataStream documentation shows how to define various window types and explains all available functions.
Note: The DataStream API has been reworked recently. The example assumes the latest version (0.10-SNAPSHOT) which will be release as 0.10.0 in the next days.
Assuming your input stream is single partition data (say String)
val new_number_of_partitions = 4
//below line partitions your data, you can broadcast data to all partitions
val step1stream = yourStream.rescale.setParallelism(new_number_of_partitions)
//flexibility for mapping
val step2stream = step1stream.map(new RichMapFunction[String, (String, Int)]{
// var local_val_to_different_part : Type = null
var myTaskId : Int = null
//below function is executed once for each mapper function (one mapper per partition)
override def open(config: Configuration): Unit = {
myTaskId = getRuntimeContext.getIndexOfThisSubtask
//do whatever initialization you want to do. read from data sources..
}
def map(value: String): (String, Int) = {
(value, myTasKId)
}
})
val step3stream = step2stream.keyBy(0).countWindow(new_number_of_partitions).sum(1).print
//Instead of sum(1), you can use .reduce((x,y)=>(x._1,x._2+y._2))
//.countWindow will first wait for a certain number of records for perticular key
// and then apply the function
Flink streaming is pure streaming (not the batched one). Take a look at Iterate API.

Resources