Flink understanding late events vs watermark - apache-flink

Looking at this article it says regarding watermark
We will now set the watermark as current time - 5 seconds, which tells Flink to expect messages
to be a maximum of 5 seconds dealy - This is because each window will be evaluated only when the watermark passes through it
later in that post it explains that when setting allowed lateness :
Flink will not discard message unless it is past the window_end_time + allowed lateness
is actually causing a delayed evaluation of the window since allowed lateness is set ?
so what is the differnce actually in the usage of watermark and allowed lateness ? when to use which ?

The watermark delay sets a lower bound on how long Flink will wait for out-of-order events before triggering a window for the first time.
The allowed lateness determines for how much longer Flink will keep around a window's state. Any late events that arrive while the window's state is still available will trigger the window again, causing it to produce updated results.
Once the allowed lateness has expired, a window's state is purged, and any supremely late events are either dropped or sent to a side output.
If the downstream consumers of your window's output can deal with receiving updates for window results like this (e.g., the window is connected to a live dashboard), then it may make sense to set a relatively short watermark delay, and use allowed lateness liberally. On the other hand, if you can't take advantage of anything after the initial results, you'll want to make the watermark delay large enough to satisfy your requirements for accuracy/completeness, and set allowed lateness to zero.

Related

Some questions related to Fraud detection demo from Flink DataStream API

The example is very useful at first,it illustrates how keyedProcessFunction is working in Flink
there is something worth noticing, it suddenly came to me...
It is from Fraud Detector v2: State + Time part
It is reasonable to set a timer here, regarding the real application requirement part
override def onTimer(
timestamp: Long,
ctx: KeyedProcessFunction[Long, Transaction, Alert]#OnTimerContext,
out: Collector[Alert]): Unit = {
// remove flag after 1 minute
timerState.clear()
flagState.clear()
}
Here is the problem:
The TimeCharacteristic IS ProcessingTime which is determined by the system clock of the running machine, according to ProcessingTime property, the watermark will NOT be changed overtime, so that means onTimer will never be called, unless the TimeCharacteristic changes to eventTime
According the flink website:
An hourly processing time window will include all records that arrived at a specific operator between the times when the system clock indicated the full hour. For example, if an application begins running at 9:15am, the first hourly processing time window will include events processed between 9:15am and 10:00am, the next window will include events processed between 10:00am and 11:00am, and so on.
If the watermark doesn't change over time, will the window function be triggered? because the condition for a window to be triggered is when the watermark enters the end time of a window
I'm wondering the condition where the window is triggered or not doesn't depend on watermark in priocessingTime, even though the official website doesn't mention that at all, it will be based on the processing time to trigger the window
Hope someone can spend a little time on this,many thx!
Let me try to clarify a few things:
Flink provides two kinds of timers: event time timers, and processing time timers. An event time timer is triggered by the arrival of a watermark equal to or greater than the timer's timestamp, and a processing time timer is triggered by the system clock reaching the timer's timestamp.
Watermarks are only relevant when doing event time processing, and only purpose they serve is to trigger event time timers. They play no role at all in applications like the one in this DataStream API Code Walkthrough that you have referred to. If this application used event time timers, either directly, or indirectly (by using event time windows, or through one of the higher level APIs like SQL or CEP), then it would need watermarks. But since it only uses processing time timers, it has no use for watermarks.
BTW, this fraud detection example isn't using Flink's Window API, because Flink's windowing mechanism isn't a good fit for this application's requirements. Here we are trying to a match a pattern to a sequence of events within a specific timeframe -- so we want a different kind of "window" that begins at the moment of a special triggering event (a small transaction, in this case), rather than a TimeWindow (like those provided by Flink's Window API) that is aligned to the clock (i.e., 10:00am to 10:01am).

Apache flink - Early firing window implementation issue - duplicate elements received

I am having quite hard time to understand flink windowing principals and would be very pleased if you could point me in the right direction.
My purpose is to count the number of recurring events for a time interval and generate alert events if the number of recurring events is greater than a threshold.
As I understand, windowing is a perfect match for this scenario.
Additional requirement is to generate an early alert if recurring events count in a window is 2 (i.e. alert should be generated without waiting window end).
I thought that an alert event generating process window function can be used to aggregate windowed events and a custom trigger can be used to emit early results from the window based on the recurring events count (before the watermark reaches the window’s end timestamp).
I am using event-time semantics and having problems/questions for the custom trigger .
You can find the actual implementation in the gist: https://gist.github.com/simpleusr/7c56d4384f6fc9f0a61860a680bb5f36
I am using keyed state to keep track of element count in the window encounteredElementsCountState
Upon receiving first element I register EventTimeTimer to the window end. This is supposed to trigger FIRE_AND_PURGE for window closing and working as expected.
If the count exceeds threshold , I try to trigger early fire. This also seems to be successful, processwindow function is called immediately after this firing.
The problem is, I had to insert below check to the code without understanding the reason. Because the previously collected elements were again supplied to onElement method:
if (ctx.getCurrentWatermark() < 0) {
logger.debug(String.format("onElement processing skipped for eventId : %s for watermark: %s ", element.getEventId(), ctx.getCurrentWatermark()));
return TriggerResult.CONTINUE;
}
I could not figure out the reason. What I see is that when this happens the watermark value is (ctx.getCurrentWatermark()) Long.MIN_VALUE (that leaded to the above check). How can this happen?
This check seems to avoid duplicate early event generation, but I do not know why this happens and is this workaround is appropriate.
Could you please advice why the same elements are processed twice in the window?
Another question is about the keyed state usage. Does this implementation leaks any state after window is disposed? I am trying to clear all used states in clear method of the trigger but would that be enough?
Regards.
Each task has currentWatermark initialized to Long.MIN_VALUE, and this remains the local value of currentWatermark until larger watermarks have arrived from all of that task's input streams. Hopefully knowing that will help you better understand what's going on.
For what it's worth, often it's more straightforward to implement this kind of logic with a ProcessFunction than with the Window API.

Apache Flink - How to Combine AssignerWithPeriodicWatermark and AssignerWithPunctuatedWatermark?

Usecase: using EventTime and extracted timestamp from records from Kafka.
myConsumer.assignTimestampsAndWatermarks(new MyTimestampEmitter());
...
stream
.keyBy("platform")
.window(TumblingEventTimeWindows 5 mins))
.aggregate(AggFunc(), WindowFunc())
.countWindowAll(size)
.apply(someFunc)
.addSink(someSink);
What I want: Flink extracts timestamp and emits watermark for each record for an initial interval (e.g. 20 seconds), then it can periodically emits watermark (e.g. each 10s).
Reason: If I used PeriodicWatermark, at the beginning Flink will emit watermark only after some interval and the count in my 1st window of 5 mins is wrong - much larger than the count in the subsequent windows. I had a workaround setting setAutoWatermarkInterval to 100ms but this is more than necessary.
Currently, I must use AssignerWithPeriodicWatermark or AssignerWithPunctuatedWatermark. How can i implement this approach of a combining strategy? Thanks.
Before doing something unusual with your watermark generator, I would double-check that you've correctly diagnosed the situation. By and large, event-time windows should behave deterministically, and always produce the same results if presented with the same input. If you are getting results for the first window that vary depending on how often watermarks are being produced, that indicates that you probably have late events that are being dropped when the watermarks arrive more frequently, and are able to be included when the watermarks are less frequent. Perhaps your watermarks aren't correctly accounting for the actual degree of out-of-orderness your events are experiencing? Or perhaps your watermarks are based on System.currentTimeMillis(), rather than the event timestamps?
Also, it's normal for the first time window to be different than the others, because time windows are aligned to the epoch, rather than the first event. Of course, this has the effect that the first window covers a shorter period of time than all of the others, so you should expect it to contain fewer events, not more.
Setting setAutoWatermarkInterval to 100ms is a perfectly normal thing to do. But if you really want to avoid this, you might consider an AssignerWithPunctuatedWatermarks that initially returns a watermark for every event, and then after a suitable interval, returns watermarks less often.
In a punctuated watermark assigner, both the extractTimestamp and checkAndGetNextWatermark methods are called for every event. You can use some transient (non-flink) state in the assigner to keep track of either the time of the first event, or to count events, and use that information in checkAndGetNextWatermark to eventually back off and stop producing watermarks for every event (by sometimes returning null from checkAndGetNextWatermark, rather than a Watermark). Your application will always revert back to generating watermarks for every event whenever it is restarted.
This will not yield an assigner with all of the characteristics of periodic and punctuated assigners, it's simply an adaptive punctuated assigner.

What is the difference between periodic and punctuated watermarks in Apache Flink?

Will be helpful if someone give usecase example to explain the difference between each of the Watermark API with Apache flink given below
Periodic watermarks - AssignerWithPeriodicWatermarks[T]
Punctuated Watermarks - AssignerWithPunctuatedWatermarks[T]
The main difference between the two types of watermark is how/when the getWatermark method is called.
periodic watermark
With periodic watermarks, Flink calls getCurrentWatermark() at regular interval, independently of the stream of events. This interval is defined using
ExecutionConfig.setAutoWatermarkInterval(millis)
Use this class when your watermarks depend (even partially) on the processing time, or when you need watermarks to be emitted even when no event/elements has been received for a while.
punctuated watermarks
With punctuated watermarks, Flink calls checkAndGetWatermark() on each new event, i.e. right after calling assignWatermark(). An actual watermark is emitted only if checkAndGetWatermark returns a non-null value which is greater than the last watermark.
This means that if you don't receive any new element for a while, no watermark can be emitted.
Use this class if certain special elements act as markers that signify event time progress, and when you want to emit watermarks specifically at certain events. For example, you could have flags in your incoming stream marking the end of a sequence.

Flink: Watermarking with Late Elements

I am doing real-time streaming in Flink where the Kafka is the message queue. I am applying EventTimeSlidingWindow of 120 sec. and slide of 1 sec. I am also inserting the watermark at each second of Event Time.
My concern is what happened if the element will come late, after the watermark? Now I my case, Flink simply discard the message which come after its respective watermark. Is there any mechanism provided by the filnk to handle such late message, like maintaining separate window? I have also gone through the documentation but I did not get clear about it.
Apache Flink has a concept called allowed lateness for the windows to handle data that arrives after a watermark.
By default, late elements are dropped when the watermark is past the end of the window. However, Flink allows to specify a maximum allowed lateness for window operators. Allowed lateness specifies by how much time elements can be late before they are dropped, and its default value is 0. Elements that arrive after the watermark has passed the end of the window but before it passes the end of the window plus the allowed lateness, are still added to the window. Depending on the trigger used, a late but not dropped element may cause the window to fire again. This is the case for the EventTimeTrigger.
In order to make this work, Flink keeps the state of windows until their allowed lateness expires. Once this happens, Flink removes the window and deletes its state.
Also another option is SideOoutput i.e. In addition to the main stream that results from DataStream operations, you can also produce any number of additional side output result streams. The type of data in the result streams does not have to match the type of data in the main stream and the types of the different side outputs can also differ. This operation can be useful when you want to split a stream of data where you would normally have to replicate the stream and then filter out from each stream the data that you don’t want to have.
When using side outputs, you first need to define an OutputTag that will be used to identify a side output stream:
https://ci.apache.org/projects/flink/flink-docs-release-1.3/dev/stream/side_output.html
Allowed lateness can result in multiple outputs. So end of window and end of watermark from the last even is one time and then for each element that’s late another aggregated output.

Resources