Apache Flink - How to Combine AssignerWithPeriodicWatermark and AssignerWithPunctuatedWatermark? - apache-flink

Usecase: using EventTime and extracted timestamp from records from Kafka.
myConsumer.assignTimestampsAndWatermarks(new MyTimestampEmitter());
...
stream
.keyBy("platform")
.window(TumblingEventTimeWindows 5 mins))
.aggregate(AggFunc(), WindowFunc())
.countWindowAll(size)
.apply(someFunc)
.addSink(someSink);
What I want: Flink extracts timestamp and emits watermark for each record for an initial interval (e.g. 20 seconds), then it can periodically emits watermark (e.g. each 10s).
Reason: If I used PeriodicWatermark, at the beginning Flink will emit watermark only after some interval and the count in my 1st window of 5 mins is wrong - much larger than the count in the subsequent windows. I had a workaround setting setAutoWatermarkInterval to 100ms but this is more than necessary.
Currently, I must use AssignerWithPeriodicWatermark or AssignerWithPunctuatedWatermark. How can i implement this approach of a combining strategy? Thanks.

Before doing something unusual with your watermark generator, I would double-check that you've correctly diagnosed the situation. By and large, event-time windows should behave deterministically, and always produce the same results if presented with the same input. If you are getting results for the first window that vary depending on how often watermarks are being produced, that indicates that you probably have late events that are being dropped when the watermarks arrive more frequently, and are able to be included when the watermarks are less frequent. Perhaps your watermarks aren't correctly accounting for the actual degree of out-of-orderness your events are experiencing? Or perhaps your watermarks are based on System.currentTimeMillis(), rather than the event timestamps?
Also, it's normal for the first time window to be different than the others, because time windows are aligned to the epoch, rather than the first event. Of course, this has the effect that the first window covers a shorter period of time than all of the others, so you should expect it to contain fewer events, not more.
Setting setAutoWatermarkInterval to 100ms is a perfectly normal thing to do. But if you really want to avoid this, you might consider an AssignerWithPunctuatedWatermarks that initially returns a watermark for every event, and then after a suitable interval, returns watermarks less often.
In a punctuated watermark assigner, both the extractTimestamp and checkAndGetNextWatermark methods are called for every event. You can use some transient (non-flink) state in the assigner to keep track of either the time of the first event, or to count events, and use that information in checkAndGetNextWatermark to eventually back off and stop producing watermarks for every event (by sometimes returning null from checkAndGetNextWatermark, rather than a Watermark). Your application will always revert back to generating watermarks for every event whenever it is restarted.
This will not yield an assigner with all of the characteristics of periodic and punctuated assigners, it's simply an adaptive punctuated assigner.

Related

Flink understanding late events vs watermark

Looking at this article it says regarding watermark
We will now set the watermark as current time - 5 seconds, which tells Flink to expect messages
to be a maximum of 5 seconds dealy - This is because each window will be evaluated only when the watermark passes through it
later in that post it explains that when setting allowed lateness :
Flink will not discard message unless it is past the window_end_time + allowed lateness
is actually causing a delayed evaluation of the window since allowed lateness is set ?
so what is the differnce actually in the usage of watermark and allowed lateness ? when to use which ?
The watermark delay sets a lower bound on how long Flink will wait for out-of-order events before triggering a window for the first time.
The allowed lateness determines for how much longer Flink will keep around a window's state. Any late events that arrive while the window's state is still available will trigger the window again, causing it to produce updated results.
Once the allowed lateness has expired, a window's state is purged, and any supremely late events are either dropped or sent to a side output.
If the downstream consumers of your window's output can deal with receiving updates for window results like this (e.g., the window is connected to a live dashboard), then it may make sense to set a relatively short watermark delay, and use allowed lateness liberally. On the other hand, if you can't take advantage of anything after the initial results, you'll want to make the watermark delay large enough to satisfy your requirements for accuracy/completeness, and set allowed lateness to zero.

Some questions related to Fraud detection demo from Flink DataStream API

The example is very useful at first,it illustrates how keyedProcessFunction is working in Flink
there is something worth noticing, it suddenly came to me...
It is from Fraud Detector v2: State + Time part
It is reasonable to set a timer here, regarding the real application requirement part
override def onTimer(
timestamp: Long,
ctx: KeyedProcessFunction[Long, Transaction, Alert]#OnTimerContext,
out: Collector[Alert]): Unit = {
// remove flag after 1 minute
timerState.clear()
flagState.clear()
}
Here is the problem:
The TimeCharacteristic IS ProcessingTime which is determined by the system clock of the running machine, according to ProcessingTime property, the watermark will NOT be changed overtime, so that means onTimer will never be called, unless the TimeCharacteristic changes to eventTime
According the flink website:
An hourly processing time window will include all records that arrived at a specific operator between the times when the system clock indicated the full hour. For example, if an application begins running at 9:15am, the first hourly processing time window will include events processed between 9:15am and 10:00am, the next window will include events processed between 10:00am and 11:00am, and so on.
If the watermark doesn't change over time, will the window function be triggered? because the condition for a window to be triggered is when the watermark enters the end time of a window
I'm wondering the condition where the window is triggered or not doesn't depend on watermark in priocessingTime, even though the official website doesn't mention that at all, it will be based on the processing time to trigger the window
Hope someone can spend a little time on this,many thx!
Let me try to clarify a few things:
Flink provides two kinds of timers: event time timers, and processing time timers. An event time timer is triggered by the arrival of a watermark equal to or greater than the timer's timestamp, and a processing time timer is triggered by the system clock reaching the timer's timestamp.
Watermarks are only relevant when doing event time processing, and only purpose they serve is to trigger event time timers. They play no role at all in applications like the one in this DataStream API Code Walkthrough that you have referred to. If this application used event time timers, either directly, or indirectly (by using event time windows, or through one of the higher level APIs like SQL or CEP), then it would need watermarks. But since it only uses processing time timers, it has no use for watermarks.
BTW, this fraud detection example isn't using Flink's Window API, because Flink's windowing mechanism isn't a good fit for this application's requirements. Here we are trying to a match a pattern to a sequence of events within a specific timeframe -- so we want a different kind of "window" that begins at the moment of a special triggering event (a small transaction, in this case), rather than a TimeWindow (like those provided by Flink's Window API) that is aligned to the clock (i.e., 10:00am to 10:01am).

TumblingProcessingTimeWindows processing with event time characteristic is not triggered

My use-case is quite simple I receive events that contain "event timestamp", and want them to be aggregated based on event time. and the output is a periodical processing time tumbling window every 10min.
More specific, the stream of data that is keyed and need to compute counts for 7 seconds.
a tumbling window of 1 second
a sliding window for counting 7 seconds with an advance of 1 second
a windowall to output all counts every 1s
I am not able to integration test it (i.e., similar to unit test but an end-to-end testing) as the input has fake event time, which won't trigger
Here is my snippet
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)
val oneDayCounts = data
.map(t => (t.key1, t.key2, 1L, t.timestampMs))
.keyBy(0, 1)
.timeWindow(Time.seconds(1))
.sum(2)
val sevenDayCounts = oneDayCounts
.keyBy(0,1)
.timeWindow(Time.seconds(3), Time.seconds(1))
.sum(2)
// single reducer
sevenDayCounts
.windowAll(TumblingProcessingTimeWindows.of(Time.seconds(1)))
.process(...)
I use EventTime as timestamp and set up an integration test code with MiniClusterWithClientResource. also created some fake data with some event timestamp like 1234l, 4567l, etc.
EventTimeTrigger is able to be fired for sum computation but the following TumblingProcessingTimeWindow is not able to trigger. I had a Thread.sleep of 30s in the IT test code but still not triggered after the 30s
In general it's a challenge to write meaningful tests for processing time windows, since they are inherently non-deterministic. This is one reason why event time windows are generally prefered.
It's also going to be difficult to put a sleep in the right place so that is has the desired effect. But one way to keep the job running long enough for the processing time window to fire would be to use a custom source that includes a sleep. Flink streaming jobs with finite sources shut themselves down once the input has been exhausted. One final watermark with the value MAX_WATERMARK gets sent through the pipeline, which triggers all event time windows, but processing time windows are only fired if they are still running when the appointed time arrives.
See this answer for an example of a hack that works around this.
Alternatively, you might take a look at https://github.com/apache/flink/blob/master/flink-streaming-java/src/test/java/org/apache/flink/streaming/runtime/operators/windowing/TumblingProcessingTimeWindowsTest.java to see how processing time windows can be tested by mocking getCurrentProcessingTime.

Why does Apache Flink need Watermarks for Event Time Processing?

Can someone explain Event timestamp and watermark properly. I understood it from docs, but it is not so clear. A real life example or layman definition will help. Also, if it is possible give an example ( Along with some code snippet which can explain it ).Thanks in advance
Here's an example that illustrates why we need watermarks, and how they work.
In this example we have a stream of timestamped events that arrive somewhat out of order, as shown below. The numbers shown are event-time timestamps that indicate when these events actually occurred. The first event to arrive happened at time 4, and it is followed by an event that happened earlier, at time 2, and so on:
··· 23 19 22 24 21 14 17 13 12 15 9 11 7 2 4 →
Now imagine that we are trying create a stream sorter. This is meant to be an application that processes each event from a stream as it arrives, and emits a new stream containing the same events, but ordered by their timestamps.
Some observations:
(1) The first element our stream sorter sees is the 4, but we can't just immediately release it as the first element of the sorted stream. It may have arrived out of order, and an earlier event might yet arrive. In fact, we have the benefit of some god-like knowledge of this stream's future, and we can see that our stream sorter should wait at least until the 2 arrives before producing any results.
Conclusion: Some buffering, and some delay, is necessary.
(2) If we do this wrong, we could end up waiting forever. First our application saw an event from time 4, and then an event from time 2. Will an event with a timestamp less than 2 ever arrive? Maybe. Maybe not. We could wait forever and never see a 1.
Conclusion: Eventually we have to be courageous and emit the 2 as the start of the sorted stream.
(3) What we need then is some sort of policy that defines when, for any given timestamped event, to stop waiting for the arrival of earlier events.
This is precisely what watermarks do — they define when to stop waiting for earlier events.
Event-time processing in Flink depends on watermark generators that insert special timestamped elements into the stream, called watermarks.
When should our stream sorter stop waiting, and push out the 2 to start the sorted stream? When a watermark arrives with a timestamp of 2, or greater.
(4) We can imagine different policies for deciding how to generate watermarks.
We know that each event arrives after some delay, and that these delays vary, so some events are delayed more than others. One simple approach is to assume that these delays are bounded by some maximum delay. Flink refers to this strategy as bounded-out-of-orderness watermarking. It's easy to imagine more complex approaches to watermarking, but for many applications a fixed delay works well enough.
If you want to build an application like a stream sorter, Flink's KeyedProcessFunction is the right building block. It provides access to event-time timers (that is, callbacks that fire based on the arrival of watermarks), and has hooks for managing the state needed for buffering events until it's their turn to be sent downstream.

Flink: Watermarking with Late Elements

I am doing real-time streaming in Flink where the Kafka is the message queue. I am applying EventTimeSlidingWindow of 120 sec. and slide of 1 sec. I am also inserting the watermark at each second of Event Time.
My concern is what happened if the element will come late, after the watermark? Now I my case, Flink simply discard the message which come after its respective watermark. Is there any mechanism provided by the filnk to handle such late message, like maintaining separate window? I have also gone through the documentation but I did not get clear about it.
Apache Flink has a concept called allowed lateness for the windows to handle data that arrives after a watermark.
By default, late elements are dropped when the watermark is past the end of the window. However, Flink allows to specify a maximum allowed lateness for window operators. Allowed lateness specifies by how much time elements can be late before they are dropped, and its default value is 0. Elements that arrive after the watermark has passed the end of the window but before it passes the end of the window plus the allowed lateness, are still added to the window. Depending on the trigger used, a late but not dropped element may cause the window to fire again. This is the case for the EventTimeTrigger.
In order to make this work, Flink keeps the state of windows until their allowed lateness expires. Once this happens, Flink removes the window and deletes its state.
Also another option is SideOoutput i.e. In addition to the main stream that results from DataStream operations, you can also produce any number of additional side output result streams. The type of data in the result streams does not have to match the type of data in the main stream and the types of the different side outputs can also differ. This operation can be useful when you want to split a stream of data where you would normally have to replicate the stream and then filter out from each stream the data that you don’t want to have.
When using side outputs, you first need to define an OutputTag that will be used to identify a side output stream:
https://ci.apache.org/projects/flink/flink-docs-release-1.3/dev/stream/side_output.html
Allowed lateness can result in multiple outputs. So end of window and end of watermark from the last even is one time and then for each element that’s late another aggregated output.

Resources