Is there a way to define a Flink count window, which evicts all messages after a given time, if the count is not reached? - apache-flink

I am currently working on a streaming program which aggregates the data from a number of messages (8), the aggregation requires all 8 messages, so i am using a count window. All 8 messages share the same unique key. However there is no guarantee that all 8 messages will arrive. So my question is two-fold:
First what happens to a Flink count window that never closes? I am assuming the windows simply accumulate overtime, consuming more and more ram.
Secondly can I close a count window if it does not receive all of its messages within a given time? I am looking for a solution that is as real-time as possible, I already tried using a time window, however the time-of-flight of the messages varies between a few millisecond and 40 seconds.
So essentially is there a way to define a window that triggers at 8 messages, and evicts all messages from the window after a given time (in this case after 60 seconds)?

The answer for your question regarding never closing windows is that the part of the state reserved for them will never be freed.
Your described behaviour could be implemented with custom trigger and evictor on Global Window. The trigger could either wait the expected time or number of elements before emitting window while the evictor would evict all messages if there are less than 8. For some referential implementation you can have a look at CountTrigger(emits on count) and EventTimeTrigger(emits on time). For the evictor have a look at CountEvictor.

For cases like this where you need to combine stateful stream processing with timers, ProcessFunction can be a good choice. See https://ci.apache.org/projects/flink/flink-docs-release-1.2/dev/stream/process_function.html.

Related

Why are my Flink windows using so much state?

The checkpoints for my Flink job are getting larger and larger. After drilling down into individual tasks, the keyed window function seems to be responsible for most of the size. How can I reduce this?
If you have a lot of state tied up in windows, there are several possibilities:
Using incremental aggregation (by using reduce or aggregate) can dramatically reduce your storage requirements. Otherwise each event is being copied into the list of events assigned to each window.
If you are aggregating over multiple timeframes, e.g., every minute and every 10 minutes, you can cascade these windows, so that the 10 minute windows are only consuming the output of the minute-long windows, rather than every event.
If you are using sliding windows, each event is being assigned to each of the overlapping windows. For example, if your windows are 2 minutes long and sliding by 1 second, each event is being copied into 120 windows. Incremental and/or pre-aggregation will help here (a lot!), or you may want to use a KeyedProcessFunction instead of a window in order to optimize your state footprint.
If you have keyed count windows, you could have keys for which the requisite batch size is never (or only very slowly) reached, leading to more and more partial batches sitting around in state. You could implement a custom Trigger that incorporates a timeout in addition to the count-based triggering so that these partial batches are eventually processed.
If you are using globalState in a ProcessWindowFunction, the globalState for stale keys will accumulate. You can use state TTL on the state descriptor for the globalState. Note: this is the only place where window state isn't automatically freed when windows are cleared.
Or it may simply be that your key space is growing over time, and there's really nothing that can be done except to scale up the cluster.

Consolidate/discard events in count window

I just started using Flink and have a problem I'm not sure how to solve. I get events from a Kafka Topic, these events represent a "beacon" signal from a mobile device. The device sends an event every 10 seconds.
I have an external customer that is asking for a beacon from our devices but every 60 seconds. Since we are already using Flink to process other events I thought I could solve this using a count window, but I'm struggling to understand how to "discard" the first 5 events and emit only the last one. Any ideas?
There are some ways to do this. As Far as I understand the idea is as follows: You receive beacon signal each 10 sec but You actually only need the most actual one and disard the others since the client asks for the data each 60 sec.
The simplest would be ofc to use ProcessFunction with count/event time window as You said. The type of the window actually depends on Your requirements. Then You sould do something like this:
stream.timeWindow([windowSize]).process(new CustomWindowProcessFunction())
The signature of the process() method of the ProcessWindowFunctionis as follows, depending on the type of the actual function def process(context: Context, elements: Iterable[IN], out: Collector[OUT]). So basically it gives you the acces to all window elements, so You can easily only push further the elements You like.
While this is the simplest idea, you may want also to take a look at the Flink timers, as they seem to be a good solution for Your issue. They are described here.

Why does Apache Flink need Watermarks for Event Time Processing?

Can someone explain Event timestamp and watermark properly. I understood it from docs, but it is not so clear. A real life example or layman definition will help. Also, if it is possible give an example ( Along with some code snippet which can explain it ).Thanks in advance
Here's an example that illustrates why we need watermarks, and how they work.
In this example we have a stream of timestamped events that arrive somewhat out of order, as shown below. The numbers shown are event-time timestamps that indicate when these events actually occurred. The first event to arrive happened at time 4, and it is followed by an event that happened earlier, at time 2, and so on:
··· 23 19 22 24 21 14 17 13 12 15 9 11 7 2 4 →
Now imagine that we are trying create a stream sorter. This is meant to be an application that processes each event from a stream as it arrives, and emits a new stream containing the same events, but ordered by their timestamps.
Some observations:
(1) The first element our stream sorter sees is the 4, but we can't just immediately release it as the first element of the sorted stream. It may have arrived out of order, and an earlier event might yet arrive. In fact, we have the benefit of some god-like knowledge of this stream's future, and we can see that our stream sorter should wait at least until the 2 arrives before producing any results.
Conclusion: Some buffering, and some delay, is necessary.
(2) If we do this wrong, we could end up waiting forever. First our application saw an event from time 4, and then an event from time 2. Will an event with a timestamp less than 2 ever arrive? Maybe. Maybe not. We could wait forever and never see a 1.
Conclusion: Eventually we have to be courageous and emit the 2 as the start of the sorted stream.
(3) What we need then is some sort of policy that defines when, for any given timestamped event, to stop waiting for the arrival of earlier events.
This is precisely what watermarks do — they define when to stop waiting for earlier events.
Event-time processing in Flink depends on watermark generators that insert special timestamped elements into the stream, called watermarks.
When should our stream sorter stop waiting, and push out the 2 to start the sorted stream? When a watermark arrives with a timestamp of 2, or greater.
(4) We can imagine different policies for deciding how to generate watermarks.
We know that each event arrives after some delay, and that these delays vary, so some events are delayed more than others. One simple approach is to assume that these delays are bounded by some maximum delay. Flink refers to this strategy as bounded-out-of-orderness watermarking. It's easy to imagine more complex approaches to watermarking, but for many applications a fixed delay works well enough.
If you want to build an application like a stream sorter, Flink's KeyedProcessFunction is the right building block. It provides access to event-time timers (that is, callbacks that fire based on the arrival of watermarks), and has hooks for managing the state needed for buffering events until it's their turn to be sent downstream.

Extended windows

I have an always one application, listening to a Kafka stream, and processing events. Events are part of a session. And I need to do calculations based off of a sessions data. I am running into a problem trying to correctly run my calculations due to the length of my sessions. 90% of my sessions are done after 5 minutes. 99% are done after 1 hour. Sessions may last more than a day, due to this being a real-time system, there is no determined end. Session are unique, and show never collide.
I am looking for a way where I can process a window multiple times, either with an initial wait period and processing any later events after that, or a pure process per event type structure. I will need to keep all previous events around(ListState), as well as previously processed values(ValueState).
I previously thought allowedLateness would allow me to do this, but it seems the lateness is only considered for when the event should have been processed, it does not extend an actual window. GlobalWindows may also work, but I am unsure if there is a way to process a window multiple times. I believe I can used an evictor with GlobalWindows to purge the Windows after a period of inactivity(although admittedly, I did not research this yet, because I was unsure of how to trigger a GlobalWindow multiple times.
Any suggestions on how to achieve what I am looking to do would be greatly appreciated, I would also be happy to clarify any points needed.
If SessionWindows won't do the job, then you can use GlobalWindows with a custom Trigger and Evictor. The Trigger interface has onElement and timer-based callbacks that can fire whenever and as often as you like. If you go down this route, then yes, you'll also need to implement an Evictor to dispose of elements when they are no longer needed.
The documentation and the source code are helpful when trying to understand how this all fits together.

Flink: Watermarking with Late Elements

I am doing real-time streaming in Flink where the Kafka is the message queue. I am applying EventTimeSlidingWindow of 120 sec. and slide of 1 sec. I am also inserting the watermark at each second of Event Time.
My concern is what happened if the element will come late, after the watermark? Now I my case, Flink simply discard the message which come after its respective watermark. Is there any mechanism provided by the filnk to handle such late message, like maintaining separate window? I have also gone through the documentation but I did not get clear about it.
Apache Flink has a concept called allowed lateness for the windows to handle data that arrives after a watermark.
By default, late elements are dropped when the watermark is past the end of the window. However, Flink allows to specify a maximum allowed lateness for window operators. Allowed lateness specifies by how much time elements can be late before they are dropped, and its default value is 0. Elements that arrive after the watermark has passed the end of the window but before it passes the end of the window plus the allowed lateness, are still added to the window. Depending on the trigger used, a late but not dropped element may cause the window to fire again. This is the case for the EventTimeTrigger.
In order to make this work, Flink keeps the state of windows until their allowed lateness expires. Once this happens, Flink removes the window and deletes its state.
Also another option is SideOoutput i.e. In addition to the main stream that results from DataStream operations, you can also produce any number of additional side output result streams. The type of data in the result streams does not have to match the type of data in the main stream and the types of the different side outputs can also differ. This operation can be useful when you want to split a stream of data where you would normally have to replicate the stream and then filter out from each stream the data that you don’t want to have.
When using side outputs, you first need to define an OutputTag that will be used to identify a side output stream:
https://ci.apache.org/projects/flink/flink-docs-release-1.3/dev/stream/side_output.html
Allowed lateness can result in multiple outputs. So end of window and end of watermark from the last even is one time and then for each element that’s late another aggregated output.

Resources