Flink: SessionWindowTimeGapExtractor - Compute the gap dynamically using data density - apache-flink

I have a message coming from Kafka into flink and I would like to create an EventTimeSessionWindows.withDynamicGap() that is adapting over time considering the density of the data. To do this I have to create an enriched message that is holding my "Event" + "the gap" that I have to calculate dynamically.
The enriched message will then be: Tuple2<Event, Long>> where
Event: is a pojo that contains a CSV from kafka [tom, 53, 1.70, 18282822, ...] and
Long: is the gap parameter in millis [129293838]
Currently this part of my code is:
DataStream<Tuple2<Event, Long>> enriched = stream
.keyBy((Event ride) -> ride.CorrID)
.map(new StatefulSessionCalculator());
Where StatefulSessionCalculator() enriches the message creating the Tuple2 describe above.
After this i have to take the calculated gap out using something like this:
DataStream<Tuple2<Event, Long>> result = enriched
.keyBy((...) -> ride.CorrID)
.window(EventTimeSessionWindows.withDynamicGap(new DynamicSessionWindows())
My function DynamicSessionWindows() should do the job feeding back to flink the long but I don't understand how. This would just be a class that extends SessionWindowTimeGapExtractor<Tuple2<MyEvent, Long>> and returns the gap from the extract() method.
I have the theory but I would need an example of how to do it.
If anyone can help me with this by putting down some code, it would be really appreciated.
Thanks

Here we go, I found how to do it. It was a simple question but beeing new to JAVA and FLINK made me struggle a bit. I have also created a KeySelector
WindowedStream<Tuple2<Event, Long>, String, TimeWindow> result = enriched
.keyBy(new MyKeySelector())
.window(EventTimeSessionWindows.withDynamicGap(new DynamicSessionWindows()));
And my DynamicSessionWindows() is this one:
public class DynamicSessionWindows implements SessionWindowTimeGapExtractor<Tuple2<Event, Long>> {
#Override
public long extract(Tuple2<Event, Long> value){
return value.f1;
}
}

Related

How to handle FLINK window on stream data's timestamp base?

I have some question.
Based on the timestamp in the class, I would like to make a logic that excludes data that has entered N or more times in 1 minute.
UserData class has a timestamp variable.
class UserData{
public Timestamp timestamp;
public String userId;
}
At first I tried to use a tumbling window.
SingleOutputStreamOperator<UserData> validStream =
stream.keyBy((KeySelector<UserData, String>) value -> value.userId)
.window(TumblingProcessingTimeWindows.of(Time.seconds(60)))
.process(new ValidProcessWindow());
public class ValidProcessWindow extends ProcessWindowFunction<UserData, UserData, String, TimeWindow> {
private int validCount = 10;
#Override
public void process(String key, Context context, Iterable<UserData> elements, Collector<UserData> out) throws Exception {
int count = -1;
for (UserData element : elements) {
count++; // start is 0
if (count >= validCount) // valid click count
{
continue;
}
out.collect(element);
}
}
}
However, the time calculation of the tumbling window is based on a fixed time, so it is not suitable regardless of the timestamp of the UserData class.
How to handle window on stream UserData class's timestamp base?
Thanks.
Additinal Information
I use code like this.
stream.assignTimestampsAndWatermarks(WatermarkStrategy.<UserData>forBoundedOutOfOrderness(Duration.ofSeconds(1))
.withTimestampAssigner((event, timestamp) -> Timestamps.toMillis(event.timestamp))
.keyBy((KeySelector<UserData, String>) value -> value.userId)
.window(TumblingEventTimeWindows.of(Time.seconds(60)))
.process(new ValidProcessWindow());
I tried some test.
150 sample data. The timestamp of each data increased by 1 second.
result is |1,2,3....59| |60,61....119| .
I wait last 30 data. but is not processed.
I expected |1,2,3....59| |60,61....119| |120...149|.
How can I get last other datas?
Self Answer
I found the cause.
Because I use only 150 sample data.
If use event time at Flink can not progress if there are no elements to be processed.
https://ci.apache.org/projects/flink/flink-docs-release-1.9/dev/event_time.html#idling-sources
So, I tested 150 sample data and dummy data. (dummy data timestamp of each data increased by 1 second).
I received correct data |1,2,3....59| |60,61....119| |120...149|.
Thank you for help.
So as far as I understand Your problem, You should just use different Time Characteristic. Processing time is using the system time to calculate windows, You should use event time for Your application. You can find more info about proper usage of event time here.
EDIT:
That's how flink works, there is no data to push watermark past 150, so window is not closed and thus no output. You can use custom trigger that will close the window even if watermark is not generated or inject some data to move the watermark.

how can I implement keyed window timeouts in Flink?

I have keyed events coming in on a stream that I would like to accumulate by key, up to a timeout (say, 5 minutes), and then process the events accumulated up to that point (and ignore everything after for that key, but first things first).
I am new to Flink, but conceptually I think I need something like the code below.
DataStream<Tuple2<String, String>> dataStream = see
.socketTextStream("localhost", 9999)
.flatMap(new Splitter())
.keyBy(0)
.window(GlobalWindows.create())
.trigger(ProcessingTimeTrigger.create()) // how do I set the timeout value?
.fold(new Tuple2<>("", ""), new FoldFunction<Tuple2<String, String>, Tuple2<String, String>>() {
public Tuple2<String, String> fold(Tuple2<String, String> agg, Tuple2<String, String> elem) {
if ( agg.f0.isEmpty()) {
agg.f0 = elem.f0;
}
if ( agg.f1.isEmpty()) {
agg.f1 = elem.f1;
} else {
agg.f1 = agg.f1 + "; " + elem.f1;
}
return agg;
}
});
This code does NOT compile because a ProcessingTimeTrigger needs a TimeWindow, and GlobalWindow is not a TimeWindow. So...
How can I accomplish keyed window timeouts in Flink?
You will have a much easier time if you approach this with a KeyedProcessFunction.
I suggest establishing an item of keyed ListState in the open() method of a KeyedProcessFunction. In the processElement() method, if the list is empty, set a processing-time timer (a per-key timer, relative to the current time) to fire when you want the window to end. Then append the incoming event to the list.
When the timer fires the onTimer() method will be called, and you can iterate over the list, produce a result, and clear the list.
To arrange for only doing all of this only once per key, add a ValueState<Boolean> to the KeyedProcessFunction to keep track of this. (Note that if your key space is unbounded, you should think about a strategy for eventually expiring the state for stale keys.)
The documentation describes how to use Process Functions and how to work with state. You can find additional examples in the Flink training site, such as this exercise.

Proper way to assign watermark with DateStreamSource<List<T>> using Flink

I have a continuing JSONArray data produced to Kafka topic,and I wanna process records with EventTime characteristic.In order to reach this goal,I have to assign watermark to each record which contained in the JSONArray.
I didn't find a convenience way to achieve this goal.My solution is consuming data from DataStreamSource> ,then iterate List and collect Object to downstream with an anonymous ProcessFunction,finally assign watermark to the this downstream.
The major code shows below:
DataStreamSource<List<MockData>> listDataStreamSource = KafkaSource.genStream(env);
SingleOutputStreamOperator<MockData> convertToPojo = listDataStreamSource
.process(new ProcessFunction<List<MockData>, MockData>() {
#Override
public void processElement(List<MockData> value, Context ctx, Collector<MockData> out)
throws Exception {
value.forEach(mockData -> out.collect(mockData));
}
});
convertToPojo.assignTimestampsAndWatermarks(
new BoundedOutOfOrdernessTimestampExtractor<MockData>(Time.seconds(5)) {
#Override
public long extractTimestamp(MockData element) {
return element.getTimestamp();
}
});
SingleOutputStreamOperator<Tuple2<String, Long>> countStream = convertToPojo
.keyBy("country").window(
SlidingEventTimeWindows.of(Time.seconds(10), Time.seconds(10)))
.process(
new FlinkEventTimeCountFunction()).name("count elements");
The code seems all right without doubt,running without error as well.But ProcessWindowFunction never triggered.I tracked the Flink source code,find EventTimeTrigger never returns TriggerResult.FIRE,causing by TriggerContext.getCurrentWatermark returns Long.MIN_VALUE all the time.
What's the proper way to process List in eventtime?Any suggestion will be appreciated.
The problem is that you are applying the keyBy and window operations to the convertToPojo stream, rather than the stream with timestamps and watermarks (which you didn't assign to a variable).
If you write the code more or less like this, it should work:
listDataStreamSource = KafkaSource ...
convertToPojo = listDataStreamSource.process ...
pojoPlusWatermarks = convertToPojo.assignTimestampsAndWatermarks ...
countStream = pojoPlusWatermarks.keyBy ...
Calling assignTimestampsAndWatermarks on the convertToPojo stream does not modify that stream, but rather creates a new datastream object that includes timestamps and watermarks. You need to apply your windowing to that new datastream.

Is there an equivalent to Kafka's KTable in Apache Flink?

Apache Kafka has a concept of a KTable, where
where each data record represents an update
Essentially, I can consume a kafka topic, and only keep the latest message for per key.
Is there a similar concept available in Apache Flink? I have read about Flink's Table API, but does not seem to be solving the same problem.
Some help comparing and contrasting the 2 frameworks would be helpful. I am not looking for which is better or worse. But rather just how they differ. The answer for which is right would then depend on my requirements.
You are right. Flink's Table API and its Table class do not correspond to Kafka's KTable. The Table API is a relational language-embedded API (think of SQL integrated in Java and Scala).
Flink's DataStream API does not have a built-in concept that corresponds to a KTable. Instead, Flink offers sophisticated state management and a KTable would be a regular operator with keyed state.
For example, a stateful operator with two inputs that stores the latest value observed from the first input and joins it with values from the second input, can be implemented with a CoFlatMapFunction as follows:
DataStream<Tuple2<Long, String>> first = ...
DataStream<Tuple2<Long, String>> second = ...
DataStream<Tuple2<String, String>> result = first
// connect first and second stream
.connect(second)
// key both streams on the first (Long) attribute
.keyBy(0, 0)
// join them
.flatMap(new TableLookup());
// ------
public static class TableLookup
extends RichCoFlatMapFunction<Tuple2<Long,String>, Tuple2<Long,String>, Tuple2<String,String>> {
// keyed state
private ValueState<String> lastVal;
#Override
public void open(Configuration conf) {
ValueStateDescriptor<String> valueDesc =
new ValueStateDescriptor<String>("table", Types.STRING);
lastVal = getRuntimeContext().getState(valueDesc);
}
#Override
public void flatMap1(Tuple2<Long, String> value, Collector<Tuple2<String, String>> out) throws Exception {
// update the value for the current Long key with the String value.
lastVal.update(value.f1);
}
#Override
public void flatMap2(Tuple2<Long, String> value, Collector<Tuple2<String, String>> out) throws Exception {
// look up latest String for current Long key.
String lookup = lastVal.value();
// emit current String and looked-up String
out.collect(Tuple2.of(value.f1, lookup));
}
}
In general, state can be used very flexibly with Flink and let's you implement a wide range of use cases. There are also more state types, such as ListState and MapState and with a ProcessFunction you have fine-grained control over time, for example to remove the state of a key if it has not been updated for a certain amount of time (KTables have a configuration for that as far as I know).

Flink timeWindow get start time

I'm calculating a count (summing 1) over a timewindow as follows:
mappedUserTrackingEvent
.keyBy("videoId", "userId")
.timeWindow(Time.seconds(30))
.sum("count")
I would like to actually add the window start time as a key field too. so the result would be something like:
key: videoId=123,userId=234,time=2016-09-16T17:01:30
value: 50
So essentially aggregate count by window. End Goal is to draw a histogram of these windows.
How can I add the start of window as a field in the key? and following that align the window to 00s or 30s in this case? Is that possible?
The apply() method of the WindowFunction provides a Window object, which is a TimeWindow if you use keyBy().timeWindow(). The TimeWindow object has two methods, getStart() and getEnd() which return the timestamp of the window's start and end, respectively.
At the moment it is not possible use the sum() aggregation together with a WindowFunction. You need to do something like:
mappedUserTrackingEvent
.keyBy("videoId", "userId")
.timeWindow(Time.seconds(30))
.apply(new MySumReduceFunction(), new MyWindowFunction());`
MySumReduceFunction implements the ReduceFunction interface and compute the sum by incrementally aggregating the elements that arrive in the window. The MyWindowFunction implements WindowFunction. It receives the aggregated value through the Iterable parameter and enriches the value with the timestamp obtained from the TimeWindow parameter.
You can use the method aggregate instead of sum.
In aggregate set the secondly parameter implements WindowFunction or extends ProcessWindowFunction.
I am using the Flink-1.4.0 , recommend to use ProcessWindowFunction, like:
mappedUserTrackingEvent
.keyBy("videoId", "userId")
.timeWindow(Time.seconds(30))
.aggregate(new Count(), new MyProcessWindowFunction();
public static class MyProcessWindowFunction extends ProcessWindowFunction<Integer, Tuple2<Long, Integer>, Tuple, TimeWindow>
{
#Override
public void process(Tuple tuple, Context context, Iterable<Integer> iterable, Collector<Tuple2<Long, Integer>> collector) throws Exception
{
context.currentProcessingTime();
context.window().getStart();
}
}

Resources