I m trying to implement a CEP Pattern on FLINK on an out of order stream events.
My Stream is built in this way:
DataStream<DataInput> input = inputStream.flatMap(
new FlatMapFunction<String, DataInput>() {
#Override
public void flatMap(String value, Collector<DataInput> out) throws Exception {
for(DataInput input : JsonUtilsJackson.getInstance().initTrackingDataFromJson(value)) {
//One input can generate multiple DataInput
out.collect(input);
}
}
})
// Elements can be lately sent
.assignTimestampsAndWatermarks(WatermarkStrategy.Tracking>forBoundedOutOfOrderness(Duration.ofSeconds(10))
//Timestamp is not based on Kinesis but on data timestamp
.withTimestampAssigner((event, timestamp) -> event.getGeneratedDate().toEpochSecond()))
//CEP by KEY
.keyBy(requestId -> requestId.getTrackingData().getEntityReference());
And my pattern is linked to my Stream by the below code:
SingleOutputStreamOperator<DataOutput> enterStream = CEP.pattern(
input,
PatternStrategy.getPattern()
).process(new SpecificProcess());
My understanding of forBoundedOutOfOrderness is that if an element is injected at 11:01:00 with generatedDate field = 10:00:00, it will accept all elements with a generatedDate field between 09:59:50 and 10:00:00 and it will sort in an ascending mode.
The thing I don't understand is how to manage the periodic check of the watermark. Because this one does not depend of my Kinesis timestamp reading (11:01:00 int my exemple), how Flink will trigger the fact that he does not have to wait anymore, is that link to watermark periodic generation + out of orderness?
During my tests, the pattern is launched only one time and never launched after.
By debugging I see in CepOperator.onEventTime that events are well buffered but their timestamp is always <= timerService.currentWaterMark().
So, if someone has an explanation, it will help me. Thanks.
By the way, is there a way to have a watermark by KeyedStream, my different entitites has not the same lifetime and I miss some events.
Your question isn't entirely clear, but perhaps the information below will help you.
That role that watermarks play is that they sit at a particular spot in the stream, and mark that spot with a timestamp that indicates completeness -- at that spot in the stream, no further events are expected with timestamps less than the one in the watermark.
Watermarks don't sort the stream, but they can be used for sorting. This is what CEP does when it is used in event time mode.
forBoundedOutOfOrderness is a watermark strategy that produces watermarks periodically (by default, every 200 msec). But the watermark will only advance if there have been new events since the last watermark that can be used as justification for a larger watermark (i.e., at least one event with a larger timestamp).
Flink does not support per-key watermarking. But the FlinkKinesisConsumer supports per-shard watermarking, which may help. This will cause the shards with the most lag to hold back the watermarks, and this will avoid there being so many late events. And if you use a separate shard for each key, then you will have something similar to per-kay watermarking.
Related
The example is very useful at first,it illustrates how keyedProcessFunction is working in Flink
there is something worth noticing, it suddenly came to me...
It is from Fraud Detector v2: State + Time part
It is reasonable to set a timer here, regarding the real application requirement part
override def onTimer(
timestamp: Long,
ctx: KeyedProcessFunction[Long, Transaction, Alert]#OnTimerContext,
out: Collector[Alert]): Unit = {
// remove flag after 1 minute
timerState.clear()
flagState.clear()
}
Here is the problem:
The TimeCharacteristic IS ProcessingTime which is determined by the system clock of the running machine, according to ProcessingTime property, the watermark will NOT be changed overtime, so that means onTimer will never be called, unless the TimeCharacteristic changes to eventTime
According the flink website:
An hourly processing time window will include all records that arrived at a specific operator between the times when the system clock indicated the full hour. For example, if an application begins running at 9:15am, the first hourly processing time window will include events processed between 9:15am and 10:00am, the next window will include events processed between 10:00am and 11:00am, and so on.
If the watermark doesn't change over time, will the window function be triggered? because the condition for a window to be triggered is when the watermark enters the end time of a window
I'm wondering the condition where the window is triggered or not doesn't depend on watermark in priocessingTime, even though the official website doesn't mention that at all, it will be based on the processing time to trigger the window
Hope someone can spend a little time on this,many thx!
Let me try to clarify a few things:
Flink provides two kinds of timers: event time timers, and processing time timers. An event time timer is triggered by the arrival of a watermark equal to or greater than the timer's timestamp, and a processing time timer is triggered by the system clock reaching the timer's timestamp.
Watermarks are only relevant when doing event time processing, and only purpose they serve is to trigger event time timers. They play no role at all in applications like the one in this DataStream API Code Walkthrough that you have referred to. If this application used event time timers, either directly, or indirectly (by using event time windows, or through one of the higher level APIs like SQL or CEP), then it would need watermarks. But since it only uses processing time timers, it has no use for watermarks.
BTW, this fraud detection example isn't using Flink's Window API, because Flink's windowing mechanism isn't a good fit for this application's requirements. Here we are trying to a match a pattern to a sequence of events within a specific timeframe -- so we want a different kind of "window" that begins at the moment of a special triggering event (a small transaction, in this case), rather than a TimeWindow (like those provided by Flink's Window API) that is aligned to the clock (i.e., 10:00am to 10:01am).
Usecase: using EventTime and extracted timestamp from records from Kafka.
myConsumer.assignTimestampsAndWatermarks(new MyTimestampEmitter());
...
stream
.keyBy("platform")
.window(TumblingEventTimeWindows 5 mins))
.aggregate(AggFunc(), WindowFunc())
.countWindowAll(size)
.apply(someFunc)
.addSink(someSink);
What I want: Flink extracts timestamp and emits watermark for each record for an initial interval (e.g. 20 seconds), then it can periodically emits watermark (e.g. each 10s).
Reason: If I used PeriodicWatermark, at the beginning Flink will emit watermark only after some interval and the count in my 1st window of 5 mins is wrong - much larger than the count in the subsequent windows. I had a workaround setting setAutoWatermarkInterval to 100ms but this is more than necessary.
Currently, I must use AssignerWithPeriodicWatermark or AssignerWithPunctuatedWatermark. How can i implement this approach of a combining strategy? Thanks.
Before doing something unusual with your watermark generator, I would double-check that you've correctly diagnosed the situation. By and large, event-time windows should behave deterministically, and always produce the same results if presented with the same input. If you are getting results for the first window that vary depending on how often watermarks are being produced, that indicates that you probably have late events that are being dropped when the watermarks arrive more frequently, and are able to be included when the watermarks are less frequent. Perhaps your watermarks aren't correctly accounting for the actual degree of out-of-orderness your events are experiencing? Or perhaps your watermarks are based on System.currentTimeMillis(), rather than the event timestamps?
Also, it's normal for the first time window to be different than the others, because time windows are aligned to the epoch, rather than the first event. Of course, this has the effect that the first window covers a shorter period of time than all of the others, so you should expect it to contain fewer events, not more.
Setting setAutoWatermarkInterval to 100ms is a perfectly normal thing to do. But if you really want to avoid this, you might consider an AssignerWithPunctuatedWatermarks that initially returns a watermark for every event, and then after a suitable interval, returns watermarks less often.
In a punctuated watermark assigner, both the extractTimestamp and checkAndGetNextWatermark methods are called for every event. You can use some transient (non-flink) state in the assigner to keep track of either the time of the first event, or to count events, and use that information in checkAndGetNextWatermark to eventually back off and stop producing watermarks for every event (by sometimes returning null from checkAndGetNextWatermark, rather than a Watermark). Your application will always revert back to generating watermarks for every event whenever it is restarted.
This will not yield an assigner with all of the characteristics of periodic and punctuated assigners, it's simply an adaptive punctuated assigner.
I have a situation to do sliding count over large scale of messages using State and TimeService. The sliding size is one and the window size is larger than 10 hours. The problem I meet is the checkpointing takes a lot of time. In order to improve the performance we use the incremental checkpoints. But it is still slow when the system do the checkpoint. We figure out that the most of the time is used to serialize the timers which are used to clean data. We have a timer for each key and there are about 300M timers at all.
Any suggestion to solve this problem would be appreciated. Or we can do the count in another way?
————————————————————————————————————————————
I'd like to add some details to the situation. The sliding size is one event and the window size is more than 10 hours(There are about 300 events per second), we need to react on each event. So in this situation we did not use the windows provided by Flink. we use the keyed state to store the previous information instead. The timers is used in ProcessFunction to trigger the cleaning job of the old data. At last the number of the dinstinct keys is very large.
I think this should work:
Dramatically reduce the number of keys Flink is working with from 300M down to 100K (for example), by effectively doing something like keyBy(key mod 100000). Your ProcessFunction can then use a MapState (where the keys are the original keys) to store whatever it needs.
MapStates have iterators, which you can use to periodically crawl each of these maps to expire old items. Stick to the principle of having only one timer per key (per uberkey, if you will), so that you only have 100K timers.
UPDATE:
Flink 1.6 included FLINK-9485, which allows timers to be checkpointed asynchronously, and to be stored in RocksDB. This makes it much more practical for Flink applications to have large numbers of timers.
What about if instead of using timers you add an extra field to every element of your stream to store the current processing time or the arrival time? So once you want to clean old data from your stream, you just have to use a filter operator and check if the data it's old engouh to be deleted.
Rather than registering a clearing timer on each event, how about you register a timer only once per some period e.g. once per 1 minute? You could register it only the first time a key is seen, plus refresh it in onTimer. Sth like:
new ProcessFunction<SongEvent, Object>() {
...
#Override
public void processElement(
SongEvent songEvent,
Context context,
Collector<Object> collector) throws Exception {
Boolean isTimerRegistered = state.value();
if (isTimerRegistered != null && !isTimerRegistered) {
context.timerService().registerProcessingTimeTimer(time);
state.update(true);
}
// Standard processing
}
#Override
public void onTimer(long timestamp, OnTimerContext ctx, Collector<Object> out)
throws Exception {
pruneElements(timestamp);
if (!elements.isEmpty()) {
ctx.timerService().registerProcessingTimeTimer(time);
} else {
state.clear();
}
}
}
Something similar is implemented for Flink SQL Over clause. You can have a look here
Will be helpful if someone give usecase example to explain the difference between each of the Watermark API with Apache flink given below
Periodic watermarks - AssignerWithPeriodicWatermarks[T]
Punctuated Watermarks - AssignerWithPunctuatedWatermarks[T]
The main difference between the two types of watermark is how/when the getWatermark method is called.
periodic watermark
With periodic watermarks, Flink calls getCurrentWatermark() at regular interval, independently of the stream of events. This interval is defined using
ExecutionConfig.setAutoWatermarkInterval(millis)
Use this class when your watermarks depend (even partially) on the processing time, or when you need watermarks to be emitted even when no event/elements has been received for a while.
punctuated watermarks
With punctuated watermarks, Flink calls checkAndGetWatermark() on each new event, i.e. right after calling assignWatermark(). An actual watermark is emitted only if checkAndGetWatermark returns a non-null value which is greater than the last watermark.
This means that if you don't receive any new element for a while, no watermark can be emitted.
Use this class if certain special elements act as markers that signify event time progress, and when you want to emit watermarks specifically at certain events. For example, you could have flags in your incoming stream marking the end of a sequence.
I am doing real-time streaming in Flink where the Kafka is the message queue. I am applying EventTimeSlidingWindow of 120 sec. and slide of 1 sec. I am also inserting the watermark at each second of Event Time.
My concern is what happened if the element will come late, after the watermark? Now I my case, Flink simply discard the message which come after its respective watermark. Is there any mechanism provided by the filnk to handle such late message, like maintaining separate window? I have also gone through the documentation but I did not get clear about it.
Apache Flink has a concept called allowed lateness for the windows to handle data that arrives after a watermark.
By default, late elements are dropped when the watermark is past the end of the window. However, Flink allows to specify a maximum allowed lateness for window operators. Allowed lateness specifies by how much time elements can be late before they are dropped, and its default value is 0. Elements that arrive after the watermark has passed the end of the window but before it passes the end of the window plus the allowed lateness, are still added to the window. Depending on the trigger used, a late but not dropped element may cause the window to fire again. This is the case for the EventTimeTrigger.
In order to make this work, Flink keeps the state of windows until their allowed lateness expires. Once this happens, Flink removes the window and deletes its state.
Also another option is SideOoutput i.e. In addition to the main stream that results from DataStream operations, you can also produce any number of additional side output result streams. The type of data in the result streams does not have to match the type of data in the main stream and the types of the different side outputs can also differ. This operation can be useful when you want to split a stream of data where you would normally have to replicate the stream and then filter out from each stream the data that you don’t want to have.
When using side outputs, you first need to define an OutputTag that will be used to identify a side output stream:
https://ci.apache.org/projects/flink/flink-docs-release-1.3/dev/stream/side_output.html
Allowed lateness can result in multiple outputs. So end of window and end of watermark from the last even is one time and then for each element that’s late another aggregated output.