Emitting the results of a session window every X minutes - apache-flink

I've implemented a Flink processor that aggregates events into sessions and then writes them to a sink. Now I'd like extend it so that I can get the number of concurrent sessions every five minutes.
The events coming into my system are on the form:
{
"SessionId": "UniqueUUID",
"Customer": "CustomerA",
"EventType": "EventTypeA",
[ ... ]
}
And a single session usually contains several events of different EventTypes. I then aggregate the events into sessions by doing the following in Flink.
DataStream<Session> sessions = events
.keyBy((KeySelector<HashMap, String>) event -> (String) event.get(Field.SESSION_ID))
.window(ProcessingTimeSessionWindows.withGap(org.apache.flink.streaming.api.windowing.time.Time.minutes(5)))
.trigger(SessionProcessingTimeTrigger.create())
.aggregate(new SessionAggregator())
Each session is the emitted (by the SessionProcessingTimeTrigger) when an event with a specific EventType is processed ("EventType":"Session.Ended"). And finally the stream is sent to a sink and written Kafka.
Now I want to write a similar Flink processor but instead of only emitting a session once it is finished, I instead want to emit all sessions every 5 minutes in order to keep track of how many concurrent session we have every 5 minutes.
So in a sense I guess what I want is a SessionWindow that also emits it's contents at regular intervals without purging the content.
I'm stumped on how to accomplish this in Flink and are therefore looking for some aid.

Whenever you want a Flink window to emit results at non-default times, you can do this by implementing a custom Trigger. You trigger just needs to return FIRE each time a 5-minute-long timer fires, in addition to its original logic. You'll want to register this timer when the first event is assigned to a window, and again every time the timer fires.
In the case of session windows this can be more complex because of the manner in which session windows are merged. But I believe that in the case of processing time session windows what I've outlined above will work.

Related

Some questions related to Fraud detection demo from Flink DataStream API

The example is very useful at first,it illustrates how keyedProcessFunction is working in Flink
there is something worth noticing, it suddenly came to me...
It is from Fraud Detector v2: State + Time part
It is reasonable to set a timer here, regarding the real application requirement part
override def onTimer(
timestamp: Long,
ctx: KeyedProcessFunction[Long, Transaction, Alert]#OnTimerContext,
out: Collector[Alert]): Unit = {
// remove flag after 1 minute
timerState.clear()
flagState.clear()
}
Here is the problem:
The TimeCharacteristic IS ProcessingTime which is determined by the system clock of the running machine, according to ProcessingTime property, the watermark will NOT be changed overtime, so that means onTimer will never be called, unless the TimeCharacteristic changes to eventTime
According the flink website:
An hourly processing time window will include all records that arrived at a specific operator between the times when the system clock indicated the full hour. For example, if an application begins running at 9:15am, the first hourly processing time window will include events processed between 9:15am and 10:00am, the next window will include events processed between 10:00am and 11:00am, and so on.
If the watermark doesn't change over time, will the window function be triggered? because the condition for a window to be triggered is when the watermark enters the end time of a window
I'm wondering the condition where the window is triggered or not doesn't depend on watermark in priocessingTime, even though the official website doesn't mention that at all, it will be based on the processing time to trigger the window
Hope someone can spend a little time on this,many thx!
Let me try to clarify a few things:
Flink provides two kinds of timers: event time timers, and processing time timers. An event time timer is triggered by the arrival of a watermark equal to or greater than the timer's timestamp, and a processing time timer is triggered by the system clock reaching the timer's timestamp.
Watermarks are only relevant when doing event time processing, and only purpose they serve is to trigger event time timers. They play no role at all in applications like the one in this DataStream API Code Walkthrough that you have referred to. If this application used event time timers, either directly, or indirectly (by using event time windows, or through one of the higher level APIs like SQL or CEP), then it would need watermarks. But since it only uses processing time timers, it has no use for watermarks.
BTW, this fraud detection example isn't using Flink's Window API, because Flink's windowing mechanism isn't a good fit for this application's requirements. Here we are trying to a match a pattern to a sequence of events within a specific timeframe -- so we want a different kind of "window" that begins at the moment of a special triggering event (a small transaction, in this case), rather than a TimeWindow (like those provided by Flink's Window API) that is aligned to the clock (i.e., 10:00am to 10:01am).

Using Broadcast State To Force Window Closure Using Fake Messages

Description:
Currently I am working on using Flink with an IOT setup. Essentially, devices are sending data such as (device_id, device_type, event_timestamp, etc) and I don't have any control over when the messages get sent. I then key the steam by device_id and device_type to preform aggregations. I would like to use event-time given that is ensures the timers which are set trigger in a deterministic nature given a failure. However, given that this isn't always a high throughput stream a window could be opened for a 10 minute aggregation period, but not have its next point come until approximately 40 minutes later. Although the calculation would aggregation would eventually be completed it would output my desired result extremely late.
So my work around for this is to create an additional external source that does nothing other than pump fake messages. By having these fake messages being pumped out in alignment with my 10 minute aggregation period, even if a device hadn't sent any data, the event time windows would have something to force the windows closed. The critical part here is to make it possible that all parallel instances / operators have access to this fake message because I need to close all the windows with this single fake message. I was thinking that Broadcast state might be the most appropriate way to accomplish this goal given: "Broadcast state is replicated across all parallel instances of a function, and might typically be used where you have two streams, a regular data stream alongside a control stream that serves rules, patterns, or other configuration messages." Quote Source
Questions:
Is broadcast state the best method for ensuring all parallel instances (e.g. windows) receive my fake messages?
Once the operators have access to this fake message via the broadcast state can this fake message then be used to advance the event time watermark?
You can make this work with broadcast state, along the lines you propose, but I'm not convinced it's the best solution.
In an ideal world I'd suggest you arrange for the devices to send occasional keepalive messages, but assuming that's not possible, I think a custom Trigger would work well here. You can extend the EventTimeTrigger so that in addition to the event time timer it creates via
ctx.registerEventTimeTimer(window.maxTimestamp());
you also create a processing time timer, as a fallback, and you FIRE the window if the window still exists when that processing time timer fires.
I'm recommending this approach because it's simpler and more directly addresses the specific need. With the broadcast state approach you'll have to introduce a source for these messages, add a broadcast state descriptor and stream, add special fake watermarks for the non-broadcast stream (set to Watermark.MAX_WATERMARK), connect the broadcast and non-broadcast streams and implement a BroadcastProcessFunction (that probably doesn't really do anything), etc. It's a lot of moving parts spread across several different operators.

TumblingProcessingTimeWindows processing with event time characteristic is not triggered

My use-case is quite simple I receive events that contain "event timestamp", and want them to be aggregated based on event time. and the output is a periodical processing time tumbling window every 10min.
More specific, the stream of data that is keyed and need to compute counts for 7 seconds.
a tumbling window of 1 second
a sliding window for counting 7 seconds with an advance of 1 second
a windowall to output all counts every 1s
I am not able to integration test it (i.e., similar to unit test but an end-to-end testing) as the input has fake event time, which won't trigger
Here is my snippet
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)
val oneDayCounts = data
.map(t => (t.key1, t.key2, 1L, t.timestampMs))
.keyBy(0, 1)
.timeWindow(Time.seconds(1))
.sum(2)
val sevenDayCounts = oneDayCounts
.keyBy(0,1)
.timeWindow(Time.seconds(3), Time.seconds(1))
.sum(2)
// single reducer
sevenDayCounts
.windowAll(TumblingProcessingTimeWindows.of(Time.seconds(1)))
.process(...)
I use EventTime as timestamp and set up an integration test code with MiniClusterWithClientResource. also created some fake data with some event timestamp like 1234l, 4567l, etc.
EventTimeTrigger is able to be fired for sum computation but the following TumblingProcessingTimeWindow is not able to trigger. I had a Thread.sleep of 30s in the IT test code but still not triggered after the 30s
In general it's a challenge to write meaningful tests for processing time windows, since they are inherently non-deterministic. This is one reason why event time windows are generally prefered.
It's also going to be difficult to put a sleep in the right place so that is has the desired effect. But one way to keep the job running long enough for the processing time window to fire would be to use a custom source that includes a sleep. Flink streaming jobs with finite sources shut themselves down once the input has been exhausted. One final watermark with the value MAX_WATERMARK gets sent through the pipeline, which triggers all event time windows, but processing time windows are only fired if they are still running when the appointed time arrives.
See this answer for an example of a hack that works around this.
Alternatively, you might take a look at https://github.com/apache/flink/blob/master/flink-streaming-java/src/test/java/org/apache/flink/streaming/runtime/operators/windowing/TumblingProcessingTimeWindowsTest.java to see how processing time windows can be tested by mocking getCurrentProcessingTime.

Apache Flink - how to skip all but most recent window on startup

In Flink, I have a Job with a Keyed Stream of events (e.g.: 10 events for each Key every day on average).
They are handled as Sliding Windows based on Event-Time (e.g.: 90-days window size and 1-day window slide).
Events are consumed from Kafka, which persists all event history (e.g.: last 3 years).
Sometimes I'd like to restart Flink: for maintenance, bug handling, etc.
Or start a fresh Flink instance with Kafka already containing event history.
In such case I would like to skip triggering for all but the most recent window for each Key.
(It's specific to my use-case: each window, when processed, effectively overrides processing results from previous windows. So at startup, I would like to process only single most recent window for each Key.)
Is it possible in Flink? If so, then how to do it?
You can use
FlinkKafkaConsumer<T> myConsumer = new FlinkKafkaConsumer<>(...);
myConsumer.setStartFromTimestamp(...); // start from specified epoch timestamp (milliseconds)
which is described along with other related functions in the section of the docs on Kafka Consumers Start Position Configuration.
Or you could use a savepoint to do a clean upgrade/redeploy without losing your kafka offsets and associated window contents.

Custom Windows charging in Flink

I am using Flink's TimeWindow functionality to perform some computations. I am creating a 5 minute Window. However I want to create a one hour Window for only the first time. The next Windows I need are of 5 minutes.
Such that for the first hour, data is collected and my operation is performed on it. Once this is done, every five minutes the same operation is performed.
I figure out this can be implemented with a trigger but I am not sure which trigger I should use and how.
UPDATE: I don't think even triggers are helpful, from what I can get, they just define the time/count triggering per window, not when the first window is to be triggered.
This is not trivial to implement.
Given a KeyedStream you have to use a GlobalWindow and a custom stateful Trigger which "remembers" whether is has fired for the first time or not.
val stream: DataStream[(String, Int)] = ???
val result = stream
.keyBy(0)
.window(GlobalWindows.create())
.trigger(new YourTrigger())
.apply(new YourWindowFunction())
Details about GlobalWindow and Trigger are in the Flink Window documentation.

Resources