How to debug multiple trigger events on keyed window - apache-flink

My DataStream is derived from a custom SourceFunction, which emits string-sequences of WINDOW size, in a deterministic sequence.
The aim is to crete sliding windows over the keyedstream for processing on the accumulated strings, based on EventTime.
To assign EventTime and Watermark, I attach an AssignerWithPeriodicWaterMarks to the stream.
The sliding window is processed with a custom ProcessWindowFunction.
env.setStreamTimeCharacteristic(EventTime)
val seqStream = env.addSource(Seqstream)
.assignTimestampsAndWatermarks(SeqTimeStampExtractor())
.keyBy(getEventtimeKey)
.window(SlidingEventTimeWindows.of(Time.milliseconds(windowSize), Time.milliseconds(slideSize)))
val result = seqStream.process(ProcessSeqWindow(target1))
My AssignerWithPeriodicWaterMarks looks like this:
class FASTATimeStampExtractor : AssignerWithPeriodicWatermarks<FASTAstring> {
var waterMark = 9999L
override fun extractTimestamp(element: FASTAstring, previousElementTimestamp: Long): Long {
return element.f1
}
override fun getCurrentWatermark(): Watermark? {
waterMark += 1
return Watermark(waterMark)
}
}
In other words, each element emitted by the source should have its own EvenTime, and the WaterMark should be emitted allowing no further events for that time.
Stepping through the stream in a debugger, indicates that EventTime / Watremarks are generated as would expected.
My expectation is that ProcessSeqWindow.run() ought to be called with a number of elements proportional to the time window (e.g. 10 ms), over EventTime. However, what I observe is that run() is called multiple times with single elements, and in an arbitrary sequence with respect to EventTime.
The behaviour persists, when I force parallelism to 1.
My question is whether this is likely to be caused by multiple trigger-events on each window, or are there other possible explainations? How can I debug the cause?
Thanks

The role of the watermarks in your job will be to trigger the closing
of the sliding event time windows. In order to play that role
properly, they should be based on the timestamps in the events, rather
than some arbitrary constant (9999L). The reason why the same object
is responsible for extracting timestamps and supplying watermarks is
so that this object can base the watermarks it creates on its
observations of the timestamps in the event stream. So unless your
event timestamps are also based on incrementing a similar counter,
this may explain some of the behavior you are seeing.
Another issue is that while extractTimestamp is called for every
event, in a periodic watermark assigner the getCurrentWatermark method
is called in a separate thread once every 200 msec (by default). If
you want watermarks after every event you'll need to use an
AssignerWithPunctuatedWatermarks, though doing so is something of an
anti-pattern (because having that many watermarks adds overhead).
If your timestamps are completely artificial, you might find a
SlidingCountWindow a more natural fit for what you're doing.

Related

How do I find the event time difference between consecutive events in Flink?

I want to find the event time difference between every two consecutive input events. If the time difference is above a certain threshold then I want to output an event signalling the threshold has been breached. I also want the first event of the stream to always output this breach signal as an indication that it does not have a previous event to calculate a time difference with.
I tried using Flink's CEP library as it ensures that the events are ordered by event time.
The pattern I created is as follows:
Pattern.begin("begin").optional().next("end");
I use the optional() clause to cater for the first event as I figured the first event would be the only event where "begin" would not have a value.
When I input my events a1 a2 a3 a4 a5 I get the following output matches:
{a1} {a1 a2} {a2} {a2 a3} {a3} {a3 a4} {a4} {a4 a5}...
However I want the following as it will allow me to calculate the time difference between each consecutive event.
{a1} {a1 a2} {a2 a3} {a3 a4} {a4 a5}...
I have tried playing around with different AfterMatchSkipStrategy settings as well as IterativeCondition clauses but with no success.
Marking "begin" as optional is what's causing the unwanted matches. I would look for some other way to generate the breach signal for the first event -- e.g., perhaps you could prepend a dummy first event.
Another approach would be to only use CEP or SQL for sorting the stream, and then use a RichFlatMap or stateful process function to implement the business logic: i.e., compute the differences and generate the breach signals.
See Can I use Flink CEP to sort a stream? for how to do this.

Flink - processing consecutive events within time constraint

I have a use case and I think I need some help on how to approach it.
Because I am new to streaming and Flink I will try to be very descriptive in what I am trying to achieve. Sorry if I am not using to formal and correct language.
My code will be in java but I do not care to get code in python or just pseudo code or approach.
TL:DR
Group events of same key that are within some time limit.
Out of those events, create a result event only from the 2 most closest (time domain) events.
This require (I think) opening a window for each and every event that comes.
If you'll look ahead at the batch solution you will understand best my problem.
Background:
I have data coming from sensors as a stream from Kafka.
I need to use eventTime because that data comes unrecorded. The lateness that will give me 90% of events is about 1 minute.
I am grouping those events by some key.
What I want to do:
Depending on some event's fields - I would like to "join/mix" 2 events into a new event ("result event").
The first condition is that those consecutive events are WITHIN 30 seconds from each other.
The next conditions are simply checking some fields values and than deciding.
My psuedo solution:
open a new window for EACH event. That window should be of 1 minute.
For every event that comes within that minute - I want to check it's event time and see if it is 30 seconds from the initial window event. If yes - check for other condition and omit a new result stream.
The Problem - When a new event comes it needs to:
create a new window for itself.
Join only ONE window out of SEVERAL possible windows that are 30 seconds from it.
The question:
Is that possible?
In other words my connection is between two "consecutive" events only.
Thank you very much.
Maybe showing the solution for **BATCH case will show what I am trying to do best:**
for i in range(grouped_events.length):
event_A = grouped_events[i]
event_B = grouped_events[i+1]
if event_B.get("time") - event_A.get("time") < 30:
if event_B.get("color") == event_A.get("color"):
if event_B.get("size") > event_A.get("size"):
create_result_event(event_A, event_B)
My (naive) tries so far with Flink in java
**The sum function is just a place holder for my function to create a new result object...
First solution is just doing a simple time window and summing by some field
Second is trying to do some process function on the window and maybe there iterate throw all events and check for my conditions?
DataStream
.keyBy(threeEvent -> threeEvent.getUserId())
.window(TumblingEventTimeWindows.of(Time.seconds(60)))
.sum("size")
.print();
DataStream
.keyBy(threeEvent -> threeEvent.getUserId())
.window(TumblingEventTimeWindows.of(Time.seconds(5)))
.process(new processFunction());
public static class processFunction extends ProcessWindowFunction<ThreeEvent, Tuple3<Long, Long, Float>, Long, TimeWindow> {
#Override
public void process(Long key, Context context, Iterable<ThreeEvent> threeEvents, Collector<Tuple3<Long, Long, Float>> out) throws Exception {
Float sumOfSize = 0F;
for (ThreeEvent f : threeEvents) {
sumOfSize += f.getSize();
}
out.collect(new Tuple3<>(context.window().getEnd(), key, sumOfTips));
}
}
You can, of course, use windows to create mini-batches that you sort and analyze, but it will be difficult to handle the window boundaries correctly (what if the events that should be paired land in different windows?).
This looks like it would be much more easily done with a keyed stream and a stateful flatmap. Just use a RichFlatMapFunction and use one piece of keyed state (a ValueState) that remembers the previous event for each key. Then as each event is processed, compare it to the saved event, produce a result if that should happen, and update the state.
You can read about working with flink's keyed state in the flink training and in the flink documentation.
The one thing that concerns me about your use case is whether or not your events may arrive out-of-order. Is it the case that to get correct results you would need to first sort the events by timestamp? That isn't trivial. If this is a concern, then I would suggest that you use Flink SQL with MATCH_RECOGNIZE, or the CEP library, both of which are designed for doing pattern recognition on event streams, and will take care of sorting the stream for you (you just have to provide timestamps and watermarks).
This query may not be exactly right, but hopefully conveys the flavor of how to do something like this with match recognize:
SELECT * FROM Events
MATCH_RECOGNIZE (
PARTITION BY userId
ORDER BY eventTime
MEASURES
A.userId as userId,
A.color as color,
A.size as aSize,
B.size as bSize
AFTER MATCH SKIP PAST LAST ROW
PATTERN (A B)
DEFINE
A AS true,
B AS ( timestampDiff(SECOND, A.eventTime, B.eventTime) < 30)
AND A.color = B.color
AND A.size < B.size )
);
This can also be done quite naturally with CEP, where the basis for comparing consecutive events is to use an iterative condition, and you can use a within clause to handle the time constraint.

Flink: Append an event to the end of finite DataStream

Assuming there is a finite DataStream (from a database source, for example) with events
a1, a2, ..., an.
How to append one more event b to this stream to get
a1, a2, ..., an, b
(i.e. output the added event after all original events, preserving the original ordering)?
I know that all finite streams emit the MAX_WATERMARK after all events. So, is there a way to "catch" this watermark and output the additional event after it?
(Unfortunately, .union()ing the original DataStream with another DataStream consisting of a single event (with timestamp set to Long.MaxValue) and then sorting the united stream using this answer did not work.)
Maybe I'm missing something, but it seems like you could simply have a ProcessFunction with an event time timer set for somewhere in the distant future, so that it only fires when the MAX_WATERMARK arrives. And then in the onTimer method, emit that special event if the currentWatermark is MAX_WATERMARK.
Another approach might be to 'wrap' the original data source in another data source, which emits a final element when the delegate object's run() method returns. You'd need to be careful to call through to all of the delegate methods, of course.

How to have a true sliding window that ignores recent events?

I was trying to build something like a window that behaves like a sliding window and:
Counts events, ignoring the ones since the end of the window up to a certain "delay"
Triggers once and only once per event
Output count of events in [event TS - delay - duration , event TS - delay]
Using pre-aggregation to avoid saving all the events.
The parameters of the window would be:
Duration: duration of the window
Output: offset of the events to trigger, counting from the end of the window. Analogous to "slide".
Delay: offset of the events to ignore, counting from the end of the window. Essentially ignore events such that timestamp <= end of window - slide - delay.
The idea I was trying involved having a sliding window with:
Duration: duration + output + delay
Slide: output
Trigger whenever the event TS is in [window end - output, window end]. This causes only one window to trigger.
The question now is: how to filter events in order to ignore the ones before "delay"? I've thought of:
Having an aggregator that only sums the value if the event TS is between the correct bounds. This is not possible because aggregators in windows can't be a RichAggregateFunction and therefore I have no access to the window metadata. Is this assumption correct?
Having pre-aggregation with:
Typical sum reducer
RichWindowFunction that uses managed state to keep track of how many elements were seen in the "area to ignore" and subtract that from the aggregator result received. The problem is that getRuntimeContext().getState() is not maintained per window and therefore can't be used. Is this assumption correct?
Are there any alternatives I'm missing or is any of the assumptions incorrect?
I may have gotten a bit lost in the details, but maybe I see a solution.
Seems like you could use a custom Trigger that fires twice, before and after the delay. Then use a ProcessWindowFunction with incremental aggregation, and use per-window state to hold the count of the first firing (and then subtract later).
Given the complexity in putting that all together, a solution based on a ProcessFunction and managed state might be simpler.

Flink trigger on a custom window

I'm trying to evaluate Apache Flink for the use case we're currently running in production using custom code.
So let's say there's a stream of events each containing a specific attribute X which is a continuously increasing integer. That is a bunch of contiguous events have this attributes set to N, then the next batch has it set to N+1 etc.
I want to break the stream into windows of events with the same value of X and then do some computations on each separately.
So I define a GlobalWindow and a custom Trigger where in onElement method I check the attribute of any given element against the saved value of the current X (from state variable) and if they differ I conclude that we've accumulated all the events with X=CURRENT and it's time to do computation and increase the X value in the state.
The problem with this approach is that the element from the next logical batch (with X=CURRENT+1) has been already consumed but it's not a part of the previous batch.
Is there a way to put it back somehow into the stream so that it is properly accounted for the next batch?
Or maybe my approach is entirely wrong and there's an easier way to achieve what I need?
Thank you.
I think you are on a right track.
Trigger specifies when a window can be processed and results for a window can be emitted.
The WindowAssigner is the part which says to which window element will be assigned. So I would say you also need to provide a custom implementation of WindowAssigner that will assign same window to all elements with equal value of X.
A more idiomatic way to do this with Flink would be to use stream.keyBy(X).window(...). The keyBy(X) takes care of grouping elements by their particular value for X. You then apply any sort of window you like. In your case a SessionWindow may be a good choice. It will fire for each key after that key hasn't been seen for some configurable period of time.
This approach will be much more robust with regard to unordered data which you must always assume in a stream processing system.

Resources