Flink: Watermarking with Late Elements - apache-flink

I am doing real-time streaming in Flink where the Kafka is the message queue. I am applying EventTimeSlidingWindow of 120 sec. and slide of 1 sec. I am also inserting the watermark at each second of Event Time.
My concern is what happened if the element will come late, after the watermark? Now I my case, Flink simply discard the message which come after its respective watermark. Is there any mechanism provided by the filnk to handle such late message, like maintaining separate window? I have also gone through the documentation but I did not get clear about it.

Apache Flink has a concept called allowed lateness for the windows to handle data that arrives after a watermark.

By default, late elements are dropped when the watermark is past the end of the window. However, Flink allows to specify a maximum allowed lateness for window operators. Allowed lateness specifies by how much time elements can be late before they are dropped, and its default value is 0. Elements that arrive after the watermark has passed the end of the window but before it passes the end of the window plus the allowed lateness, are still added to the window. Depending on the trigger used, a late but not dropped element may cause the window to fire again. This is the case for the EventTimeTrigger.
In order to make this work, Flink keeps the state of windows until their allowed lateness expires. Once this happens, Flink removes the window and deletes its state.
Also another option is SideOoutput i.e. In addition to the main stream that results from DataStream operations, you can also produce any number of additional side output result streams. The type of data in the result streams does not have to match the type of data in the main stream and the types of the different side outputs can also differ. This operation can be useful when you want to split a stream of data where you would normally have to replicate the stream and then filter out from each stream the data that you don’t want to have.
When using side outputs, you first need to define an OutputTag that will be used to identify a side output stream:
https://ci.apache.org/projects/flink/flink-docs-release-1.3/dev/stream/side_output.html

Allowed lateness can result in multiple outputs. So end of window and end of watermark from the last even is one time and then for each element that’s late another aggregated output.

Related

How does order of streams in flink connect matter?

How would the two scenarios compare or differ
I) stream1.connect(stream2).flatMap(new connectFunction())
flatmap1 of connectFunction will process input from stream1 and flatMap2 will process input from stream2.
II) stream2.connect(stream1).flatMap(new connectFunction())
flatmap1 of connectFunction will process input from stream2 and flatMap2 will process input from stream1.
That’s all there is to it. There’s no significant difference.
As David mentioned there's no significant difference but in case you're asking because you're worried about the order in which the inputs arrive, you should always keep in mind that there is no guarantee whatsoever that one stream (for example your "control stream" in case you're doing some kind of "dynamic filter") will actually be processed before the stream you want to filter on. There could be lag due to Kafka consumer lag or many other reasons.
As such it is a common pattern to set up a state for both streams so that you can keep items in a "pending processing" state until you receive what you need from the other stream.
For instance, if you have a control stream called control that contains items from a blocklist/disallowlist that you would like to use to filter an input stream called input, after keying by a common key, it won't matter whether you do input.connect(control) or control.connect(input). However, the important point is to remember that it's not necessary that the control stream would have produced anything by the time you receive your first element from input. For that reason you would set up a state (could be a ListState[InputType] to keep track of all input items that have arrived before you received what you needed from control. In other words, you'd be keeping track of "pending items". Once you do receive the required item from control you will process all "pending" items from the ListState.
This is documented in the Learn Flink > ETL section, in the last paragraph of the Connected streams section:
It is important to recognize that you have no control over the order in which the flatMap1 and flatMap2 callbacks are called. These two input streams are racing against each other, and the Flink runtime will do what it wants to regarding consuming events from one stream or the other. In cases where timing and/or ordering matter, you may find it necessary to buffer events in managed Flink state until your application is ready to process them. (Note: if you are truly desperate, it is possible to exert some limited control over the order in which a two-input operator consumes its inputs by using a custom Operator that implements the InputSelectable interface.

Reading two streams (main and configs) in sequential in Flink

I have two streams, one is main stream let's say in example of fraud detection I have transactions stream and then I have second stream which is configs, in our example it is rules. So I connect main stream to config stream in order to do processing. But when first time flink starts and we are adding job it starts consuming from transactions and configs stream parallel and when wants process transaction it sometimes see that there is no config and we have to send transaction to dead letter queue. However, what I want to achieve is, if there is patential config which I could get a bit later I want to get that config first then get transaction in order to process it rather then sending it to dead letter queue. I have the same key for transactions and configs.
long story short, is there a way telling flink when first time job starts try to consume one stream until there isn't new value then start processing main stream? How I can make them kind of sequential?
The recommended way to approach this is to connect the 2 streams and apply a RichCoFlatMap that will allow you to buffer events from main while you're waiting to receive the config events.
Check out this useful section of the Flink tutorials. The very last paragraph actually describes your problem.
It is important to recognize that you have no control over the order in which the flatMap1 and flatMap2 callbacks are called. These two input streams are racing against each other, and the Flink runtime will do what it wants to regarding consuming events from one stream or the other. In cases where timing and/or ordering matter, you may find it necessary to buffer events in managed Flink state until your application is ready to process them. (Note: if you are truly desperate, it is possible to exert some limited control over the order in which a two-input operator consumes its inputs by using a custom Operator that implements the InputSelectable interface.
So in a nutshell you should connect your 2 streams and have some kind of ListState where you can "buffer" your main elements while waiting to receive the rules. When you receive an element from the config stream, you check whether you had some pending elements "waiting" for that config in your ListState (your buffer). If you do, you can then process these elements and emit them through the collector of your flatmap.
Starting with version 1.16, you can use the hybrid source support in Flink to read all of once source (configs, in your case) before reading the second source. Though I imagine you'd have to map the events to an Either<config, transaction> so that the data stream has consistent record types.

Flink understanding late events vs watermark

Looking at this article it says regarding watermark
We will now set the watermark as current time - 5 seconds, which tells Flink to expect messages
to be a maximum of 5 seconds dealy - This is because each window will be evaluated only when the watermark passes through it
later in that post it explains that when setting allowed lateness :
Flink will not discard message unless it is past the window_end_time + allowed lateness
is actually causing a delayed evaluation of the window since allowed lateness is set ?
so what is the differnce actually in the usage of watermark and allowed lateness ? when to use which ?
The watermark delay sets a lower bound on how long Flink will wait for out-of-order events before triggering a window for the first time.
The allowed lateness determines for how much longer Flink will keep around a window's state. Any late events that arrive while the window's state is still available will trigger the window again, causing it to produce updated results.
Once the allowed lateness has expired, a window's state is purged, and any supremely late events are either dropped or sent to a side output.
If the downstream consumers of your window's output can deal with receiving updates for window results like this (e.g., the window is connected to a live dashboard), then it may make sense to set a relatively short watermark delay, and use allowed lateness liberally. On the other hand, if you can't take advantage of anything after the initial results, you'll want to make the watermark delay large enough to satisfy your requirements for accuracy/completeness, and set allowed lateness to zero.

Apache Flink - How to Combine AssignerWithPeriodicWatermark and AssignerWithPunctuatedWatermark?

Usecase: using EventTime and extracted timestamp from records from Kafka.
myConsumer.assignTimestampsAndWatermarks(new MyTimestampEmitter());
...
stream
.keyBy("platform")
.window(TumblingEventTimeWindows 5 mins))
.aggregate(AggFunc(), WindowFunc())
.countWindowAll(size)
.apply(someFunc)
.addSink(someSink);
What I want: Flink extracts timestamp and emits watermark for each record for an initial interval (e.g. 20 seconds), then it can periodically emits watermark (e.g. each 10s).
Reason: If I used PeriodicWatermark, at the beginning Flink will emit watermark only after some interval and the count in my 1st window of 5 mins is wrong - much larger than the count in the subsequent windows. I had a workaround setting setAutoWatermarkInterval to 100ms but this is more than necessary.
Currently, I must use AssignerWithPeriodicWatermark or AssignerWithPunctuatedWatermark. How can i implement this approach of a combining strategy? Thanks.
Before doing something unusual with your watermark generator, I would double-check that you've correctly diagnosed the situation. By and large, event-time windows should behave deterministically, and always produce the same results if presented with the same input. If you are getting results for the first window that vary depending on how often watermarks are being produced, that indicates that you probably have late events that are being dropped when the watermarks arrive more frequently, and are able to be included when the watermarks are less frequent. Perhaps your watermarks aren't correctly accounting for the actual degree of out-of-orderness your events are experiencing? Or perhaps your watermarks are based on System.currentTimeMillis(), rather than the event timestamps?
Also, it's normal for the first time window to be different than the others, because time windows are aligned to the epoch, rather than the first event. Of course, this has the effect that the first window covers a shorter period of time than all of the others, so you should expect it to contain fewer events, not more.
Setting setAutoWatermarkInterval to 100ms is a perfectly normal thing to do. But if you really want to avoid this, you might consider an AssignerWithPunctuatedWatermarks that initially returns a watermark for every event, and then after a suitable interval, returns watermarks less often.
In a punctuated watermark assigner, both the extractTimestamp and checkAndGetNextWatermark methods are called for every event. You can use some transient (non-flink) state in the assigner to keep track of either the time of the first event, or to count events, and use that information in checkAndGetNextWatermark to eventually back off and stop producing watermarks for every event (by sometimes returning null from checkAndGetNextWatermark, rather than a Watermark). Your application will always revert back to generating watermarks for every event whenever it is restarted.
This will not yield an assigner with all of the characteristics of periodic and punctuated assigners, it's simply an adaptive punctuated assigner.

Is there a way to define a Flink count window, which evicts all messages after a given time, if the count is not reached?

I am currently working on a streaming program which aggregates the data from a number of messages (8), the aggregation requires all 8 messages, so i am using a count window. All 8 messages share the same unique key. However there is no guarantee that all 8 messages will arrive. So my question is two-fold:
First what happens to a Flink count window that never closes? I am assuming the windows simply accumulate overtime, consuming more and more ram.
Secondly can I close a count window if it does not receive all of its messages within a given time? I am looking for a solution that is as real-time as possible, I already tried using a time window, however the time-of-flight of the messages varies between a few millisecond and 40 seconds.
So essentially is there a way to define a window that triggers at 8 messages, and evicts all messages from the window after a given time (in this case after 60 seconds)?
The answer for your question regarding never closing windows is that the part of the state reserved for them will never be freed.
Your described behaviour could be implemented with custom trigger and evictor on Global Window. The trigger could either wait the expected time or number of elements before emitting window while the evictor would evict all messages if there are less than 8. For some referential implementation you can have a look at CountTrigger(emits on count) and EventTimeTrigger(emits on time). For the evictor have a look at CountEvictor.
For cases like this where you need to combine stateful stream processing with timers, ProcessFunction can be a good choice. See https://ci.apache.org/projects/flink/flink-docs-release-1.2/dev/stream/process_function.html.

Resources