The example is very useful at first,it illustrates how keyedProcessFunction is working in Flink
there is something worth noticing, it suddenly came to me...
It is from Fraud Detector v2: State + Time part
It is reasonable to set a timer here, regarding the real application requirement part
override def onTimer(
timestamp: Long,
ctx: KeyedProcessFunction[Long, Transaction, Alert]#OnTimerContext,
out: Collector[Alert]): Unit = {
// remove flag after 1 minute
timerState.clear()
flagState.clear()
}
Here is the problem:
The TimeCharacteristic IS ProcessingTime which is determined by the system clock of the running machine, according to ProcessingTime property, the watermark will NOT be changed overtime, so that means onTimer will never be called, unless the TimeCharacteristic changes to eventTime
According the flink website:
An hourly processing time window will include all records that arrived at a specific operator between the times when the system clock indicated the full hour. For example, if an application begins running at 9:15am, the first hourly processing time window will include events processed between 9:15am and 10:00am, the next window will include events processed between 10:00am and 11:00am, and so on.
If the watermark doesn't change over time, will the window function be triggered? because the condition for a window to be triggered is when the watermark enters the end time of a window
I'm wondering the condition where the window is triggered or not doesn't depend on watermark in priocessingTime, even though the official website doesn't mention that at all, it will be based on the processing time to trigger the window
Hope someone can spend a little time on this,many thx!
Let me try to clarify a few things:
Flink provides two kinds of timers: event time timers, and processing time timers. An event time timer is triggered by the arrival of a watermark equal to or greater than the timer's timestamp, and a processing time timer is triggered by the system clock reaching the timer's timestamp.
Watermarks are only relevant when doing event time processing, and only purpose they serve is to trigger event time timers. They play no role at all in applications like the one in this DataStream API Code Walkthrough that you have referred to. If this application used event time timers, either directly, or indirectly (by using event time windows, or through one of the higher level APIs like SQL or CEP), then it would need watermarks. But since it only uses processing time timers, it has no use for watermarks.
BTW, this fraud detection example isn't using Flink's Window API, because Flink's windowing mechanism isn't a good fit for this application's requirements. Here we are trying to a match a pattern to a sequence of events within a specific timeframe -- so we want a different kind of "window" that begins at the moment of a special triggering event (a small transaction, in this case), rather than a TimeWindow (like those provided by Flink's Window API) that is aligned to the clock (i.e., 10:00am to 10:01am).
Related
My Flink job has to compute a certain aggregation after each working shift. Shifts are configurable and look something like:
1st shift: 00:00am - 06:00am
2nd shift: 06:00am - 12:00pm
3rd shift: 12:00pm - 18:00pm
Shifts are the same every day for operational purposes, there is no distinction between days of the week/year. The shifts configuration can vary over time and can be non-monotonous, so this leaves out of the table a trivial EventTime window like:
TumblingEventTimeWindows.of(Time.of(6, HOURS)) as some of the shifts might be shrunk or spanned overtime, or a couple hours break in between might be inserted...
I have come up with something based on a GlobalWindow and a custom Trigger:
LinkedList<Shift> shifts;
datastream.windowAll(GlobalWindows.create())
.trigger(ShiftTrigger.create(shifts))
.aggregate(myAggregateFunction)
where in my custom trigger I attempt to discern if an incoming event passes the end time of the on-going working shift, and fire the window for the shift:
#Override
public TriggerResult onElement(T element, long timestamp, GlobalWindow window, TriggerContext ctx) throws Exception {
// compute the end time of the on-going shift
final Instant currentShiftEnd = ...
// fire window for the shift if the event passes the end line
if (ShiftPredicate.of(currentShiftEnd).test(element)) {
return TriggerResult.FIRE_AND_PURGE;
}
return TriggerResult.CONTINUE;
}
Omitting the code for state management and some memoization optimizations, this seems to be working fine in a streaming use case: the first event coming in after a shift endtime, triggers the firing and the aggregation for the last shift.
However the job can be run bounded for date parameters (eg: for reprocessing past periods), or be shutdown prematurely for a set of expected reasons. When this sort of thing happens, I observe that the last window is not fired/flushed,
ie: the last shift of the day ends at midnight, and right over should
start the 1st shift of the next day. An event comes at 23:59pm and the
shift is about to end. However, the job is just running for the day of
today, and at 00:00 it finishes. Since no new element arrived to the
custom trigger passing the line to trigger the window firing, the
aggregation for the last shift is not calculated, however, some
partial results are still expected, even if nothing is happening in
the next shift or the job terminates in the middle of the on-going
shift.
I've read that the reason for this is:
Flink guarantees removal only for time-based windows and not for other
types, e.g. global windows (see Window Assigners)
I have taken a look inside the org.apache.flink.streaming.api.windowing package to look for something like a TumblingEventTimeWindows or DynamicEventTimeSessionWindows that I could use or extend with an end hour of the day, so that I can rely on the default event-time trigger of these firing when the watermark of the element passes the window limit, but I'm not sure how to do it. Intuitively I'd wish for something like:
shifts.forEach(shift -> {
datastream.windowAll(EventTimeWindow.fromTo(DAILY, shift.startTime, shift.endTime))
.aggregate(myAggregateFunction);
});
I know for use cases of arbitrary complexity, what some people do is ditching the Windows API in detriment of low-level process functions, where they "manually" compute the window by holding elements as managed state of the operator, while at given rules or conditions they fit and extract results from a defined aggregate function or accumulator. Also in a process function, is possible to pin point any pending calculations by tapping into the onClose hook.
Would there be a way to get this concept of recurrent event time windows for certain hours of a day every day by extending any of the objects in the Windows API?
If I understand correctly, there are two separate questions/issues here to resolve:
How to handle not having uniform window boundaries.
How to terminate the job without losing the results of the last window.
For (1), your approach of using GlobalWindows with a custom ShiftTrigger is one way to go. If you'd like to explore an alternative that uses a process function, I've written an example that you will find in the Flink docs.
For a more fluent API, you could create a custom WindowAssigner, which could then leverage the built-in EventTimeTrigger as its default trigger. To do this, you'll need to implement the WindowAssigner interface.
For (2), so long as you are relying on event time processing, the last set of windows won't be triggered unless a Watermark large enough to close them arrives before the job is terminated. This normally requires that you have an event whose timestamp is sufficiently after the window's end that a Watermark large enough to trigger the window is created (and that the job stays running long enough for that to happen).
However, when Flink is aware that a streaming job is coming to a natural end, it will automatically inject a Watermark with its timestamp set to MAX_WATERMARK, which has the effect of triggering all event time timers, and closing all event time windows. This happens automatically for any bounded sources. With Kafka (for example), you can also arrange for this by having your deserializer return true from isEndOfStream.
Another way to handle this is to avoid canceling such jobs when they are done, but to instead use ./bin/flink stop --drain [-p savepointPath] <jobID> to cleanly stop the job (with a savepoint), while draining all remaining window results (which it does by injecting one last big watermark (MAX_WATERMARK)).
I've implemented a Flink processor that aggregates events into sessions and then writes them to a sink. Now I'd like extend it so that I can get the number of concurrent sessions every five minutes.
The events coming into my system are on the form:
{
"SessionId": "UniqueUUID",
"Customer": "CustomerA",
"EventType": "EventTypeA",
[ ... ]
}
And a single session usually contains several events of different EventTypes. I then aggregate the events into sessions by doing the following in Flink.
DataStream<Session> sessions = events
.keyBy((KeySelector<HashMap, String>) event -> (String) event.get(Field.SESSION_ID))
.window(ProcessingTimeSessionWindows.withGap(org.apache.flink.streaming.api.windowing.time.Time.minutes(5)))
.trigger(SessionProcessingTimeTrigger.create())
.aggregate(new SessionAggregator())
Each session is the emitted (by the SessionProcessingTimeTrigger) when an event with a specific EventType is processed ("EventType":"Session.Ended"). And finally the stream is sent to a sink and written Kafka.
Now I want to write a similar Flink processor but instead of only emitting a session once it is finished, I instead want to emit all sessions every 5 minutes in order to keep track of how many concurrent session we have every 5 minutes.
So in a sense I guess what I want is a SessionWindow that also emits it's contents at regular intervals without purging the content.
I'm stumped on how to accomplish this in Flink and are therefore looking for some aid.
Whenever you want a Flink window to emit results at non-default times, you can do this by implementing a custom Trigger. You trigger just needs to return FIRE each time a 5-minute-long timer fires, in addition to its original logic. You'll want to register this timer when the first event is assigned to a window, and again every time the timer fires.
In the case of session windows this can be more complex because of the manner in which session windows are merged. But I believe that in the case of processing time session windows what I've outlined above will work.
My use-case is quite simple I receive events that contain "event timestamp", and want them to be aggregated based on event time. and the output is a periodical processing time tumbling window every 10min.
More specific, the stream of data that is keyed and need to compute counts for 7 seconds.
a tumbling window of 1 second
a sliding window for counting 7 seconds with an advance of 1 second
a windowall to output all counts every 1s
I am not able to integration test it (i.e., similar to unit test but an end-to-end testing) as the input has fake event time, which won't trigger
Here is my snippet
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)
val oneDayCounts = data
.map(t => (t.key1, t.key2, 1L, t.timestampMs))
.keyBy(0, 1)
.timeWindow(Time.seconds(1))
.sum(2)
val sevenDayCounts = oneDayCounts
.keyBy(0,1)
.timeWindow(Time.seconds(3), Time.seconds(1))
.sum(2)
// single reducer
sevenDayCounts
.windowAll(TumblingProcessingTimeWindows.of(Time.seconds(1)))
.process(...)
I use EventTime as timestamp and set up an integration test code with MiniClusterWithClientResource. also created some fake data with some event timestamp like 1234l, 4567l, etc.
EventTimeTrigger is able to be fired for sum computation but the following TumblingProcessingTimeWindow is not able to trigger. I had a Thread.sleep of 30s in the IT test code but still not triggered after the 30s
In general it's a challenge to write meaningful tests for processing time windows, since they are inherently non-deterministic. This is one reason why event time windows are generally prefered.
It's also going to be difficult to put a sleep in the right place so that is has the desired effect. But one way to keep the job running long enough for the processing time window to fire would be to use a custom source that includes a sleep. Flink streaming jobs with finite sources shut themselves down once the input has been exhausted. One final watermark with the value MAX_WATERMARK gets sent through the pipeline, which triggers all event time windows, but processing time windows are only fired if they are still running when the appointed time arrives.
See this answer for an example of a hack that works around this.
Alternatively, you might take a look at https://github.com/apache/flink/blob/master/flink-streaming-java/src/test/java/org/apache/flink/streaming/runtime/operators/windowing/TumblingProcessingTimeWindowsTest.java to see how processing time windows can be tested by mocking getCurrentProcessingTime.
Usecase: using EventTime and extracted timestamp from records from Kafka.
myConsumer.assignTimestampsAndWatermarks(new MyTimestampEmitter());
...
stream
.keyBy("platform")
.window(TumblingEventTimeWindows 5 mins))
.aggregate(AggFunc(), WindowFunc())
.countWindowAll(size)
.apply(someFunc)
.addSink(someSink);
What I want: Flink extracts timestamp and emits watermark for each record for an initial interval (e.g. 20 seconds), then it can periodically emits watermark (e.g. each 10s).
Reason: If I used PeriodicWatermark, at the beginning Flink will emit watermark only after some interval and the count in my 1st window of 5 mins is wrong - much larger than the count in the subsequent windows. I had a workaround setting setAutoWatermarkInterval to 100ms but this is more than necessary.
Currently, I must use AssignerWithPeriodicWatermark or AssignerWithPunctuatedWatermark. How can i implement this approach of a combining strategy? Thanks.
Before doing something unusual with your watermark generator, I would double-check that you've correctly diagnosed the situation. By and large, event-time windows should behave deterministically, and always produce the same results if presented with the same input. If you are getting results for the first window that vary depending on how often watermarks are being produced, that indicates that you probably have late events that are being dropped when the watermarks arrive more frequently, and are able to be included when the watermarks are less frequent. Perhaps your watermarks aren't correctly accounting for the actual degree of out-of-orderness your events are experiencing? Or perhaps your watermarks are based on System.currentTimeMillis(), rather than the event timestamps?
Also, it's normal for the first time window to be different than the others, because time windows are aligned to the epoch, rather than the first event. Of course, this has the effect that the first window covers a shorter period of time than all of the others, so you should expect it to contain fewer events, not more.
Setting setAutoWatermarkInterval to 100ms is a perfectly normal thing to do. But if you really want to avoid this, you might consider an AssignerWithPunctuatedWatermarks that initially returns a watermark for every event, and then after a suitable interval, returns watermarks less often.
In a punctuated watermark assigner, both the extractTimestamp and checkAndGetNextWatermark methods are called for every event. You can use some transient (non-flink) state in the assigner to keep track of either the time of the first event, or to count events, and use that information in checkAndGetNextWatermark to eventually back off and stop producing watermarks for every event (by sometimes returning null from checkAndGetNextWatermark, rather than a Watermark). Your application will always revert back to generating watermarks for every event whenever it is restarted.
This will not yield an assigner with all of the characteristics of periodic and punctuated assigners, it's simply an adaptive punctuated assigner.
Will be helpful if someone give usecase example to explain the difference between each of the Watermark API with Apache flink given below
Periodic watermarks - AssignerWithPeriodicWatermarks[T]
Punctuated Watermarks - AssignerWithPunctuatedWatermarks[T]
The main difference between the two types of watermark is how/when the getWatermark method is called.
periodic watermark
With periodic watermarks, Flink calls getCurrentWatermark() at regular interval, independently of the stream of events. This interval is defined using
ExecutionConfig.setAutoWatermarkInterval(millis)
Use this class when your watermarks depend (even partially) on the processing time, or when you need watermarks to be emitted even when no event/elements has been received for a while.
punctuated watermarks
With punctuated watermarks, Flink calls checkAndGetWatermark() on each new event, i.e. right after calling assignWatermark(). An actual watermark is emitted only if checkAndGetWatermark returns a non-null value which is greater than the last watermark.
This means that if you don't receive any new element for a while, no watermark can be emitted.
Use this class if certain special elements act as markers that signify event time progress, and when you want to emit watermarks specifically at certain events. For example, you could have flags in your incoming stream marking the end of a sequence.