How to avoid repeated tuples in Flink slide window join? - apache-flink

For example, there are two streams. One is advertisements showed to users. The tuple in which could be described as (advertiseId, showed timestamp). The other one is click stream -- (advertiseId, clicked timestamp). We want get a joined stream, which includes all the advertisement that is clicked by user in 20 minutes after showed. My solution is to join these two streams on a SlidingTimeWindow. But in the joined stream, there are many repeated tuples. How could I get joined tuple only one in new stream?
stream1.join(stream2)
.where(0)
.equalTo(0)
.window(SlidingTimeWindows.of(Time.of(30, TimeUnit.MINUTES), Time.of(10, TimeUnit.SECONDS)))

Solution 1:
Let flink support join two streams on separate windows like Spark streaming. In this case, implement SlidingTimeWindows(21 mins, 1 min) on advertisement stream and TupblingTimeWindows(1 min) on Click stream, then join these two windowed streams.
TupblingTimeWindows could avoid duplicate records in the joined stream.
21 mins size SlidingTimeWindows could avoid missing legal clicks.
One issue is there would be some illegal click(click after 20 mins) in the joined stream. This problem could be fixed easily by adding a filter.
MultiWindowsJoinedStreams<Tuple2<String, Long>, Tuple2<String, Long>> joinedStreams =
new MultiWindowsJoinedStreams<>(advertisement, click);
DataStream<Tuple3<String, Long, Long>> joinedStream = joinedStreams.where(keySelector)
.window(SlidingTimeWindows.of(Time.of(21, TimeUnit.SECONDS), Time.of(1, TimeUnit.SECONDS)))
.equalTo(keySelector)
.window(TumblingTimeWindows.of(Time.of(1, TimeUnit.SECONDS)))
.apply(new JoinFunction<Tuple2<String, Long>, Tuple2<String, Long>, Tuple3<String, Long, Long>>() {
private static final long serialVersionUID = -3625150954096822268L;
#Override
public Tuple3<String, Long, Long> join(Tuple2<String, Long> first, Tuple2<String, Long> second) throws Exception {
return new Tuple3<>(first.f0, first.f1, second.f1);
}
});
joinedStream = joinedStream.filter(new FilterFunction<Tuple3<String, Long, Long>>() {
private static final long serialVersionUID = -4325256210808325338L;
#Override
public boolean filter(Tuple3<String, Long, Long> value) throws Exception {
return value.f1<value.f2&&value.f1+20000>=value.f2;
}
});
Solution 2:
Flink supports join operation without window. A join operator implement the interface TwoInputStreamOperator keeps two buffers(time length based) of these two streams and output one joined stream.
DataStream<Tuple2<String, Long>> advertisement = env
.addSource(new FlinkKafkaConsumer082<String>("advertisement", new SimpleStringSchema(), properties))
.map(new MapFunction<String, Tuple2<String, Long>>() {
private static final long serialVersionUID = -6564495005753073342L;
#Override
public Tuple2<String, Long> map(String value) throws Exception {
String[] splits = value.split(" ");
return new Tuple2<String, Long>(splits[0], Long.parseLong(splits[1]));
}
}).keyBy(keySelector).assignTimestamps(timestampExtractor1);
DataStream<Tuple2<String, Long>> click = env
.addSource(new FlinkKafkaConsumer082<String>("click", new SimpleStringSchema(), properties))
.map(new MapFunction<String, Tuple2<String, Long>>() {
private static final long serialVersionUID = -6564495005753073342L;
#Override
public Tuple2<String, Long> map(String value) throws Exception {
String[] splits = value.split(" ");
return new Tuple2<String, Long>(splits[0], Long.parseLong(splits[1]));
}
}).keyBy(keySelector).assignTimestamps(timestampExtractor2);
NoWindowJoinedStreams<Tuple2<String, Long>, Tuple2<String, Long>> joinedStreams =
new NoWindowJoinedStreams<>(advertisement, click);
DataStream<Tuple3<String, Long, Long>> joinedStream = joinedStreams
.where(keySelector)
.buffer(Time.of(20, TimeUnit.SECONDS))
.equalTo(keySelector)
.buffer(Time.of(5, TimeUnit.SECONDS))
.apply(new JoinFunction<Tuple2<String, Long>, Tuple2<String, Long>, Tuple3<String, Long, Long>>() {
private static final long serialVersionUID = -5075871109025215769L;
#Override
public Tuple3<String, Long, Long> join(Tuple2<String, Long> first, Tuple2<String, Long> second) throws Exception {
return new Tuple3<>(first.f0, first.f1, second.f1);
}
});
I implemented two new join operators base on Flink streaming API TwoInputTransformation. Please check Flink-stream-join. I will add more tests to this repository.

On your code, you defined an overlapping sliding window (slide is smaller than window size). If you don't want to have duplicates you can define a non-overlapping window by only specifying the window size (the default slide is equal to the window size).

While searching for a solution for the same problem, I found the "Interval Join" very useful, which does not repeatedly output the same elements. This is the example from the Flink documentation:
DataStream<Integer> orangeStream = ...
DataStream<Integer> greenStream = ...
orangeStream
.keyBy(<KeySelector>)
.intervalJoin(greenStream.keyBy(<KeySelector>))
.between(Time.milliseconds(-2), Time.milliseconds(1))
.process (new ProcessJoinFunction<Integer, Integer, String(){
#Override
public void processElement(Integer left, Integer right, Context ctx, Collector<String> out) {
out.collect(first + "," + second);
}
});
With this no explicit window has to be defined, instead an interval that is used for each single element like this (image from Flink documentation):

Related

Flink Watermarks on Event Time

I'm trying to understand watermarks with Event Time.
My Code is similar with Flink Documentation WordCount example .
I did some changes to include timestamp on event and added watermarks.
Event format is: word;timestamp
The map function creates a tuple3 with word;1;timestamp.
Then it's assign a watermark strategy with timestamp assigner equals to event timestamp field.
For the following stream events:
test;1662128808294
test;1662128818065
test;1662128822434
test;1662128826434
test;1662128831175
test;1662128836581
I got the following result: (test,6) => This is correct, i sent 6 times test word.
But looking for context in ProcessFunction i see the following:
Processing Time: Fri Sep 02 15:27:20 WEST 2022
Watermark: Fri Sep 02 15:26:56 WEST 2022
Start Window: 2022 09 02 15:26:40 End Window: 2022 09 02 15:27:20
The window it's correct, it's 40 seconds window as defined, and the watermark also it's correct, it's 20 seconds less the last event timestamp (1662128836581 = Friday, September 2, 2022 3:27:16) as defined in watermark strategy.
My Question is the window processing time. The window fired exactly at end window time processing time, but shouldn't wait until watermark pass the end of window (something like processing time = end of window + 20 seconds) (Window Default Trigger Docs) ?
What i'm doing wrong? or i'm having a bad understanding about watermarks?
My Code:
public class DataStreamJob {
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(1);
WatermarkStrategy<Tuple3<String, Integer, Long>> strategy = WatermarkStrategy
.<Tuple3<String, Integer, Long>>forBoundedOutOfOrderness(Duration.ofSeconds(20))
.withTimestampAssigner((event, timestamp) -> event.f2);
DataStream<Tuple2<String, Integer>> dataStream = env
.socketTextStream("localhost", 9999)
.map(new Splitter())
.assignTimestampsAndWatermarks(strategy)
.keyBy(value -> value.f0)
.window(TumblingProcessingTimeWindows.of(Time.seconds(40)))
.process(new MyProcessWindowFunction());
dataStream.print();
env.execute("Window WordCount");
}
public static class Splitter extends RichMapFunction<String, Tuple3<String, Integer, Long>> {
#Override
public Tuple3<String, Integer, Long> map(String value) throws Exception {
String[] word = value.split(";");
return new Tuple3<String, Integer, Long>(word[0], 1, Long.parseLong(word[1]));
}
}
public static class MyProcessWindowFunction extends ProcessWindowFunction<Tuple3<String, Integer, Long>, Tuple2<String, Integer>, String, TimeWindow> {
#Override
public void process(String s, ProcessWindowFunction<Tuple3<String, Integer, Long>, Tuple2<String, Integer>, String, TimeWindow>.Context context, Iterable<Tuple3<String, Integer, Long>> elements, Collector<Tuple2<String, Integer>> out) throws Exception {
Integer sum = 0;
for (Tuple3<String, Integer, Long> in : elements) {
sum++;
}
out.collect(new Tuple2<String, Integer>(s, sum));
Date date = new Date(context.window().getStart());
Date date2 = new Date(context.window().getEnd());
Date watermark = new Date(context.currentWatermark());
Date processingTime = new Date(context.currentProcessingTime());
System.out.println(context.currentWatermark());
System.out.println("Processing Time: " + processingTime);
Format format = new SimpleDateFormat("yyyy MM dd HH:mm:ss");
System.out.println("Watermark: " + watermark);
System.out.println("Start Window: " + format.format(date) + " End Window: " + format.format(date2));
}
}
}
Thanks.
To get event time windows, you need to change
.window(TumblingProcessingTimeWindows.of(Time.seconds(40)))
to
.window(TumblingEventTimeWindows.of(Time.seconds(40)))

Upgrading Flink deprecated function calls

I am currently trying to upgrade a method call assignTimestampsAndWatermarks that is applied to a data stream. The data stream looks something like this:
DataStream<Auction> auctions = env.addSource(new AuctionSourceFunction(auctionSrcRates))
.name("Custom Source")
.setParallelism(params.getInt("p-auction-source", 1))
.assignTimestampsAndWatermarks(new AuctionTimestampAssigner());
The AssignerWithPeriodicWatermark looks like this:
private static final class AuctionTimestampAssigner implements AssignerWithPeriodicWatermarks<Auction> {
private long maxTimestamp = Long.MIN_VALUE;
#Nullable
#Override
public Watermark getCurrentWatermark() {
return new Watermark(maxTimestamp);
}
#Override
public long extractTimestamp(Auction element, long previousElementTimestamp) {
maxTimestamp = Math.max(maxTimestamp, element.dateTime);
return element.dateTime;
}
}
What are the steps I would need to take to upgrade from deprecated calls to the current best practices? Thanks.
Your watermark generator assumes that the events are in order, by timestamp, or at least accepts that any out-of-order events will be late. This is equivalent to
assignTimestampsAndWatermarks(
WatermarkStrategy
.<Auction>forMonotonousTimestamps()
.withTimestampAssigner((event, timestamp) -> event.dateTime))

Find stream of events that are not grouped using coGroupFunction

How can we find stream of events that are not matched with other events, when using CoGroupFunction?
Lets consider people are communicating over a phone call. In Tuple2<String, Integer>, f0 is name of person and f1 is phone number they are calling to OR receiving call from.
We have paired them using coGroup, however we are missing people who are getting calls from person outside the world.
final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);
DataStream<Tuple2<String, Integer>> callers = env.fromElements(
new Tuple2<String, Integer>("alice->", 12), // alice dials 12
new Tuple2<String, Integer>("bob->", 13), // bob dials 13
new Tuple2<String, Integer>("charlie->", 19))
.assignTimestampsAndWatermarks(new TimestampExtractor(Time.seconds(5)));
DataStream<Tuple2<String, Integer>> callees = env.fromElements(
new Tuple2<String, Integer>("->carl", 12), // carl received call
new Tuple2<String, Integer>("->ted", 13),
new Tuple2<String, Integer>("->chris", 7))
.assignTimestampsAndWatermarks(new TimestampExtractor(Time.seconds(5)));;
DataStream<Tuple1<String>> groupedStream = callers.coGroup(callees)
.where(evt -> evt.f1).equalTo(evt -> evt.f1)
.window(TumblingEventTimeWindows.of(Time.seconds(10)))
.apply(new IntEqualCoGroupFunc());
groupedStream.print(); // prints 1> (alice->-->carl) \n 1> (bob->-->ted)
//DataStream<Tuple1<String>> notGroupedStream = ..; // people without pairs in last window
//notGroupedStream.print(); // should print charlie->-->someone \n someone->-->chris
env.execute();
To be honest, the simplest solution seems to be changing the IntEqualCoGroupFunc, so that instead of String it returns (Boolean, String).
This is because coGroup processes also those elements that do not have matching keys, those elements will have one Iterable empty in the function coGroup(Iterable<IN1> first, Iterable<IN2> second, Collector<O> out) i.e. for Your case it would receive ("->chris", 7) as first and empty Iterable as second.
The change of the signature could allow You to easily emit also results that do not have matching keys and simply split them into the separate streams at later stage of processing.
// Implementation of IntEqualCoGroupFunc
#Override
public void coGroup(Iterable<Tuple2<String, Integer>> outbound, Iterable<Tuple2<String, Integer>> inbound,
Collector<Tuple1<String>> out) throws Exception {
for (Tuple2<String, Integer> outboundObj : outbound) {
for (Tuple2<String, Integer> inboundObj : inbound) {
out.collect(Tuple1.of(outboundObj.f0 + "-" + inboundObj.f0)); //matching pair
return;
}
out.collect(Tuple1.of(outboundObj.f0 + "->someone")); //inbound is empty
return;
}
// outbound is empty
for (Tuple2<String, Integer> inboundObj : inbound) {
out.collect(Tuple1.of("someone->-" + inboundObj.f0));
return;
}
//inbound also empty
out.collect(Tuple1.of("someone->-->someone"));
}
Output as follows:
2> (someone->-->chris)
2> (charlie->->someone)
1> (alice->-->carl)
1> (bob->-->ted)

Flink DataSet Tuple Values not coming as expected

I have a Dataset<Tuple3<String,String,Double>> values which has the following data:
<Vijaya,Chocolate,5>
<Vijaya,Chips,10>
<Rahul,Chocolate,2>
<Rahul,Chips,8>
I want the DataSet<Tuple5<String,String,Double,String,Double>> values1as following:
<Vijaya,Chocolate,5,Chips,10>
<Rahul,Chocolate,2,Chips,8>
My code looks like following:
DataSet<Tuple5<String, String, Double, String, Double>> values1 = values.fullOuterJoin(values)
.where(0)
.equalTo(0)
.with(
new JoinFunction<Tuple3<String, String, Double>, Tuple3<String, String, Double>, Tuple5<String, String, Double, String, Double>>() {
private static final long serialVersionUID = 1L;
public Tuple5<String, String, Double, String, Double> join(Tuple3<String, String, Double> first, Tuple3<String, String, Double> second) {
return new Tuple5<String, String, Double, String, Double>(first.f0, first.f1, first.f2, second.f1, second.f2);
}
})
.distinct(1, 3)
.distinct(1);
In the above code I tried doing self join.I want the output in that particular format but I am unable to get it.
How to do this?
Please help.
Since you don't want the output to have the same item repeated, you can use a flat-join, in which you can output only those records that have the value in the 2nd position not equal to the value in the 4th position. Also, if you want only "chocolate" in the 2nd position, that can also be checked inside the FlatJoinFunction. Please find below the link to Flink's documentation about Flat-join.
https://ci.apache.org/projects/flink/flink-docs-release-1.4/dev/batch/dataset_transformations.html#join-with-flat-join-function
Approach using GroupReduceFunction:
values
.groupBy(0)
.reduceGroup(new GroupReduceFunction<Tuple3<String,String,Double>, Tuple2<String, String>>() {
#Override
public void reduce(Iterable<Tuple3<String,String,Double>> in, Collector<Tuple2<String, String>> out) {
StringBuilder output = new StringBuilder();
String name = null;
for (Tuple3<String,String,Double> item : in) {
name = item.f0;
output.append(item.f1+","+item.f2+",");
}
out.collect(new Tuple2<String, String>(name,output.toString()));
}
});

The begin and end time of windows

What would be the way to show the begin and end time of windows? something like implement user-defined windows?
Would like to know the time that windows begin and evaluate such that the output is
quantity(WindowAll Sum), window_start_time, window_end_time
12, 1:13:21, 1:13:41
6, 1:13:41, 1:15:01
Found the answer. TimeWindow.class has getStart() and getEnd()
example usage:
public static class SumAllWindow implements AllWindowFunction<Tuple2<String,Integer>,
Tuple3<Integer, String, String>, TimeWindow> {
private static transient DateTimeFormatter timeFormatter =
DateTimeFormat.forPattern("yyyy-MM-dd'T'HH:mm:ss.SS").withLocale(Locale.GERMAN).
withZone(DateTimeZone.forID("Europe/Berlin"));
#Override
public void apply (TimeWindow window, Iterable<Tuple2<String, Integer>> values,
Collector<Tuple3<Integer, String, String>> out) throws Exception {
DateTime startTs = new DateTime(window.getStart(), DateTimeZone.forID("Europe/Berlin"));
DateTime endTs = new DateTime(window.getEnd(), DateTimeZone.forID("Europe/Berlin"));
int sum = 0;
for (Tuple2<String, Integer> value : values) {
sum += value.f1;
}
out.collect(new Tuple3<>(sum, startTs.toString(timeFormatter), endTs.toString(timeFormatter)));
}
}
in main()
msgStream.timeWindowAll(Time.of(6, TimeUnit.SECONDS)).apply(new SumAllWindow()).print();

Resources