Flink - processing consecutive events within time constraint - apache-flink

I have a use case and I think I need some help on how to approach it.
Because I am new to streaming and Flink I will try to be very descriptive in what I am trying to achieve. Sorry if I am not using to formal and correct language.
My code will be in java but I do not care to get code in python or just pseudo code or approach.
TL:DR
Group events of same key that are within some time limit.
Out of those events, create a result event only from the 2 most closest (time domain) events.
This require (I think) opening a window for each and every event that comes.
If you'll look ahead at the batch solution you will understand best my problem.
Background:
I have data coming from sensors as a stream from Kafka.
I need to use eventTime because that data comes unrecorded. The lateness that will give me 90% of events is about 1 minute.
I am grouping those events by some key.
What I want to do:
Depending on some event's fields - I would like to "join/mix" 2 events into a new event ("result event").
The first condition is that those consecutive events are WITHIN 30 seconds from each other.
The next conditions are simply checking some fields values and than deciding.
My psuedo solution:
open a new window for EACH event. That window should be of 1 minute.
For every event that comes within that minute - I want to check it's event time and see if it is 30 seconds from the initial window event. If yes - check for other condition and omit a new result stream.
The Problem - When a new event comes it needs to:
create a new window for itself.
Join only ONE window out of SEVERAL possible windows that are 30 seconds from it.
The question:
Is that possible?
In other words my connection is between two "consecutive" events only.
Thank you very much.
Maybe showing the solution for **BATCH case will show what I am trying to do best:**
for i in range(grouped_events.length):
event_A = grouped_events[i]
event_B = grouped_events[i+1]
if event_B.get("time") - event_A.get("time") < 30:
if event_B.get("color") == event_A.get("color"):
if event_B.get("size") > event_A.get("size"):
create_result_event(event_A, event_B)
My (naive) tries so far with Flink in java
**The sum function is just a place holder for my function to create a new result object...
First solution is just doing a simple time window and summing by some field
Second is trying to do some process function on the window and maybe there iterate throw all events and check for my conditions?
DataStream
.keyBy(threeEvent -> threeEvent.getUserId())
.window(TumblingEventTimeWindows.of(Time.seconds(60)))
.sum("size")
.print();
DataStream
.keyBy(threeEvent -> threeEvent.getUserId())
.window(TumblingEventTimeWindows.of(Time.seconds(5)))
.process(new processFunction());
public static class processFunction extends ProcessWindowFunction<ThreeEvent, Tuple3<Long, Long, Float>, Long, TimeWindow> {
#Override
public void process(Long key, Context context, Iterable<ThreeEvent> threeEvents, Collector<Tuple3<Long, Long, Float>> out) throws Exception {
Float sumOfSize = 0F;
for (ThreeEvent f : threeEvents) {
sumOfSize += f.getSize();
}
out.collect(new Tuple3<>(context.window().getEnd(), key, sumOfTips));
}
}

You can, of course, use windows to create mini-batches that you sort and analyze, but it will be difficult to handle the window boundaries correctly (what if the events that should be paired land in different windows?).
This looks like it would be much more easily done with a keyed stream and a stateful flatmap. Just use a RichFlatMapFunction and use one piece of keyed state (a ValueState) that remembers the previous event for each key. Then as each event is processed, compare it to the saved event, produce a result if that should happen, and update the state.
You can read about working with flink's keyed state in the flink training and in the flink documentation.
The one thing that concerns me about your use case is whether or not your events may arrive out-of-order. Is it the case that to get correct results you would need to first sort the events by timestamp? That isn't trivial. If this is a concern, then I would suggest that you use Flink SQL with MATCH_RECOGNIZE, or the CEP library, both of which are designed for doing pattern recognition on event streams, and will take care of sorting the stream for you (you just have to provide timestamps and watermarks).
This query may not be exactly right, but hopefully conveys the flavor of how to do something like this with match recognize:
SELECT * FROM Events
MATCH_RECOGNIZE (
PARTITION BY userId
ORDER BY eventTime
MEASURES
A.userId as userId,
A.color as color,
A.size as aSize,
B.size as bSize
AFTER MATCH SKIP PAST LAST ROW
PATTERN (A B)
DEFINE
A AS true,
B AS ( timestampDiff(SECOND, A.eventTime, B.eventTime) < 30)
AND A.color = B.color
AND A.size < B.size )
);
This can also be done quite naturally with CEP, where the basis for comparing consecutive events is to use an iterative condition, and you can use a within clause to handle the time constraint.

Related

How do I find the event time difference between consecutive events in Flink?

I want to find the event time difference between every two consecutive input events. If the time difference is above a certain threshold then I want to output an event signalling the threshold has been breached. I also want the first event of the stream to always output this breach signal as an indication that it does not have a previous event to calculate a time difference with.
I tried using Flink's CEP library as it ensures that the events are ordered by event time.
The pattern I created is as follows:
Pattern.begin("begin").optional().next("end");
I use the optional() clause to cater for the first event as I figured the first event would be the only event where "begin" would not have a value.
When I input my events a1 a2 a3 a4 a5 I get the following output matches:
{a1} {a1 a2} {a2} {a2 a3} {a3} {a3 a4} {a4} {a4 a5}...
However I want the following as it will allow me to calculate the time difference between each consecutive event.
{a1} {a1 a2} {a2 a3} {a3 a4} {a4 a5}...
I have tried playing around with different AfterMatchSkipStrategy settings as well as IterativeCondition clauses but with no success.
Marking "begin" as optional is what's causing the unwanted matches. I would look for some other way to generate the breach signal for the first event -- e.g., perhaps you could prepend a dummy first event.
Another approach would be to only use CEP or SQL for sorting the stream, and then use a RichFlatMap or stateful process function to implement the business logic: i.e., compute the differences and generate the breach signals.
See Can I use Flink CEP to sort a stream? for how to do this.

How to stop Apache Flink CEP Pattern?

please help me, i've two questions:
I read from Apache Kafka json-messages,(then I have steps: deserialization to POJO, filter, keyBy ....)
Which is better to use: KeyedProcessFunction (with state, timers, if-else logic blocks) or Flink CEP pattern library?
I can check input sequence in KeyedProcessFunction (check state, if-else blocks, out.collect(...), state.clear()...you will understand me),as well as I can use Flink CEP library with conditions and quantificators.
How to stop flink CEP Pattern?
For Example:
I have input sequence: A1, (no events 1min) A2, (no events 5 min) А3, (no events 1 min) А4, (no events more 5 minutes) A5. (between A1 and A5 maybe a lot of events)
I want to send in output:A1, A3, A5.
First event, then if the next event came in less than 5 minutes after previous event it will not send to output, if the next event came in more than 5 minutes after previous it will send to output.
What should I add to my pattern???
Pattern<Event, ?> pattern = Pattern.
<Event>begin("start")
.where(new SimpleCondition<Event>(){
public boolean filter(Event event){
return event.getName().contains("A");
}
}).within(Time.minutes(5));
While at first glance this particular example seems rather trivial to implement as a KeyedProcessFunction, there is definitely some complexity that arises if the messages can arrive out of order. Then you might be fooled into thinking there can been a substantial gap, when in fact there was not.
However, this particular example is a good match for session windows, if you want an easy, out-of-the-box, ready-made solution.
With CEP, I think a working solution would have this flavor: you are looking for a sequence of an A (call it A1) followed immediately by another A (call it A2), where (A2.timestamp - A1.timestamp) >= 5 minutes. When a match is found, emit A1 and advance the matching engine so that A2 becomes the new A1. (Conveniently, CEP pre-sorts the input stream(s), so you don't have to worry about things being out-of-order.)

How to debug multiple trigger events on keyed window

My DataStream is derived from a custom SourceFunction, which emits string-sequences of WINDOW size, in a deterministic sequence.
The aim is to crete sliding windows over the keyedstream for processing on the accumulated strings, based on EventTime.
To assign EventTime and Watermark, I attach an AssignerWithPeriodicWaterMarks to the stream.
The sliding window is processed with a custom ProcessWindowFunction.
env.setStreamTimeCharacteristic(EventTime)
val seqStream = env.addSource(Seqstream)
.assignTimestampsAndWatermarks(SeqTimeStampExtractor())
.keyBy(getEventtimeKey)
.window(SlidingEventTimeWindows.of(Time.milliseconds(windowSize), Time.milliseconds(slideSize)))
val result = seqStream.process(ProcessSeqWindow(target1))
My AssignerWithPeriodicWaterMarks looks like this:
class FASTATimeStampExtractor : AssignerWithPeriodicWatermarks<FASTAstring> {
var waterMark = 9999L
override fun extractTimestamp(element: FASTAstring, previousElementTimestamp: Long): Long {
return element.f1
}
override fun getCurrentWatermark(): Watermark? {
waterMark += 1
return Watermark(waterMark)
}
}
In other words, each element emitted by the source should have its own EvenTime, and the WaterMark should be emitted allowing no further events for that time.
Stepping through the stream in a debugger, indicates that EventTime / Watremarks are generated as would expected.
My expectation is that ProcessSeqWindow.run() ought to be called with a number of elements proportional to the time window (e.g. 10 ms), over EventTime. However, what I observe is that run() is called multiple times with single elements, and in an arbitrary sequence with respect to EventTime.
The behaviour persists, when I force parallelism to 1.
My question is whether this is likely to be caused by multiple trigger-events on each window, or are there other possible explainations? How can I debug the cause?
Thanks
The role of the watermarks in your job will be to trigger the closing
of the sliding event time windows. In order to play that role
properly, they should be based on the timestamps in the events, rather
than some arbitrary constant (9999L). The reason why the same object
is responsible for extracting timestamps and supplying watermarks is
so that this object can base the watermarks it creates on its
observations of the timestamps in the event stream. So unless your
event timestamps are also based on incrementing a similar counter,
this may explain some of the behavior you are seeing.
Another issue is that while extractTimestamp is called for every
event, in a periodic watermark assigner the getCurrentWatermark method
is called in a separate thread once every 200 msec (by default). If
you want watermarks after every event you'll need to use an
AssignerWithPunctuatedWatermarks, though doing so is something of an
anti-pattern (because having that many watermarks adds overhead).
If your timestamps are completely artificial, you might find a
SlidingCountWindow a more natural fit for what you're doing.

Synchronize Apache Flink streams based on a time stamp

I have several use cases where I need to synchronize multiple streams based on a time stamp.
Here is an example where I want to sync trade bars and quote bars which I generate for example like this from raw trades and quotes, which I aggregate:
val tradeBars: DataStream[TradeBar] = trades
.assignAscendingTimestamps(_.epochMillis)
.keyBy("key")
.window(TumblingEventTimeWindows.of(Time.minutes(5)))
.process(new TimeTradeBar(new DownTick()))
val quotesWithFlow = quotes
.assignAscendingTimestamps(_.epochMillis)
.keyBy("key")
.countWindow(2, 1)
.reduce((previousQuote, quote) => Quote.localOrderFlow(previousQuote, quote))
.assignAscendingTimestamps(_.epochMillis)
.keyBy("key")
val quoteBars: DataStream[QuoteBar] = quotesWithFlow
.assignAscendingTimestamps(_.epochMillis)
.keyBy("key")
.window(TumblingEventTimeWindows.of(Time.minutes(5)))
.process(new QuoteBars.TimeQuoteBar())
val joined: JoinedStreams[TradeBar, QuoteBar]#Where[LocalDateTime]#EqualTo = tradeBars
.join(quoteBars)
.where(_.start).equalTo(_.start)
// need a window here, just want to sync on same time window
I tried to use the window join function of Flink, but apparently this expects now a window function and then I can do an apply method. All I want is to sync the streams on the same time window. I suspect that was not the intention of the join method.
I have a working implementation which uses the Flink stream connect method. I applied it to the trade bars stream and the raw quote stream, but that requires that I code a pretty messy CoProcessFunction myself
CoProcessTradeBarsAndQuotes() extends CoProcessFunction[TradeBar, Quote, (TradeBar, QuoteBar)]
{}
Which is pretty messy because I have to keep track of quotes in a buffer and carefully perform the aggregation from the process1 and process2 function. I guess there must be a simpler way, I just don't see it. Grateful for any help and ideas.
You didn't mention the logic you'd use to decide which two stocks (of likely many) to join, but in general I'd solve this by generating an output record from the first window function (open, high, low, close, stock) with an additional field representing the time (truncated to the hour) of the window, then key by that time field and do another windowing operation to create the join of the stocks that you need.

How to implement groupByUntil in apache-flink?

Some reactive frameworks has groupByUntil function. It is allow to group elements by key and remove it after specific event or time interval (i.e. here description from RxJS).
As I can see apache-flink doesn't have such function out of the box. Can anybody explain me how to implement such function in apache-flink?
Did you have a look at Flink's time windows? Windows are used to group elements of a stream, for example by time and key.
You can define a tumbling time window as follows:
val s: DataStream[(Int, Long)] = ...
val r: DataStream[(Int,Long)] = s
.keyBy(_._1)
.timeWindow(Time.minutes(5))
.minBy(2)
This will partition the stream be the first Int element (_._1) and create every five minutes a window for each key to group the elements. On each window, the minBy function is applied to select the element with the smallest Long value.
You can also define sliding windows, count windows, or implement you own windowing logic using Triggers and Evictors. The window evaluation function (minBy in the example) can also be a custom implementation.
You should check the DataStream documentation for more details.

Resources