Flink Watermarks on Event Time - apache-flink

I'm trying to understand watermarks with Event Time.
My Code is similar with Flink Documentation WordCount example .
I did some changes to include timestamp on event and added watermarks.
Event format is: word;timestamp
The map function creates a tuple3 with word;1;timestamp.
Then it's assign a watermark strategy with timestamp assigner equals to event timestamp field.
For the following stream events:
test;1662128808294
test;1662128818065
test;1662128822434
test;1662128826434
test;1662128831175
test;1662128836581
I got the following result: (test,6) => This is correct, i sent 6 times test word.
But looking for context in ProcessFunction i see the following:
Processing Time: Fri Sep 02 15:27:20 WEST 2022
Watermark: Fri Sep 02 15:26:56 WEST 2022
Start Window: 2022 09 02 15:26:40 End Window: 2022 09 02 15:27:20
The window it's correct, it's 40 seconds window as defined, and the watermark also it's correct, it's 20 seconds less the last event timestamp (1662128836581 = Friday, September 2, 2022 3:27:16) as defined in watermark strategy.
My Question is the window processing time. The window fired exactly at end window time processing time, but shouldn't wait until watermark pass the end of window (something like processing time = end of window + 20 seconds) (Window Default Trigger Docs) ?
What i'm doing wrong? or i'm having a bad understanding about watermarks?
My Code:
public class DataStreamJob {
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(1);
WatermarkStrategy<Tuple3<String, Integer, Long>> strategy = WatermarkStrategy
.<Tuple3<String, Integer, Long>>forBoundedOutOfOrderness(Duration.ofSeconds(20))
.withTimestampAssigner((event, timestamp) -> event.f2);
DataStream<Tuple2<String, Integer>> dataStream = env
.socketTextStream("localhost", 9999)
.map(new Splitter())
.assignTimestampsAndWatermarks(strategy)
.keyBy(value -> value.f0)
.window(TumblingProcessingTimeWindows.of(Time.seconds(40)))
.process(new MyProcessWindowFunction());
dataStream.print();
env.execute("Window WordCount");
}
public static class Splitter extends RichMapFunction<String, Tuple3<String, Integer, Long>> {
#Override
public Tuple3<String, Integer, Long> map(String value) throws Exception {
String[] word = value.split(";");
return new Tuple3<String, Integer, Long>(word[0], 1, Long.parseLong(word[1]));
}
}
public static class MyProcessWindowFunction extends ProcessWindowFunction<Tuple3<String, Integer, Long>, Tuple2<String, Integer>, String, TimeWindow> {
#Override
public void process(String s, ProcessWindowFunction<Tuple3<String, Integer, Long>, Tuple2<String, Integer>, String, TimeWindow>.Context context, Iterable<Tuple3<String, Integer, Long>> elements, Collector<Tuple2<String, Integer>> out) throws Exception {
Integer sum = 0;
for (Tuple3<String, Integer, Long> in : elements) {
sum++;
}
out.collect(new Tuple2<String, Integer>(s, sum));
Date date = new Date(context.window().getStart());
Date date2 = new Date(context.window().getEnd());
Date watermark = new Date(context.currentWatermark());
Date processingTime = new Date(context.currentProcessingTime());
System.out.println(context.currentWatermark());
System.out.println("Processing Time: " + processingTime);
Format format = new SimpleDateFormat("yyyy MM dd HH:mm:ss");
System.out.println("Watermark: " + watermark);
System.out.println("Start Window: " + format.format(date) + " End Window: " + format.format(date2));
}
}
}
Thanks.

To get event time windows, you need to change
.window(TumblingProcessingTimeWindows.of(Time.seconds(40)))
to
.window(TumblingEventTimeWindows.of(Time.seconds(40)))

Related

How to add a custom WatermarkGenerator to a WatermarkStrategy

I'm using Apache Flink 1.11 and want to use some custom WatermarkGenerator.
With the Watermarkstrategy, you can add built-in WatermarkGenerators with ease:
WatermarkStrategy.forMonotonousTimestamps();
WatermarkStrategy.forBoundedOutOfOrderness(Duration.ofSeconds(10));
In the documentation, you can see how to implement a custom Watermarkgenerator, for example a Periodic WatermarkGenerator:
public class BoundedOutOfOrdernessGenerator implements WatermarkGenerator<MyEvent> {
private final long maxOutOfOrderness = 3500; // 3.5 seconds
private long currentMaxTimestamp;
#Override
public void onEvent(MyEvent event, long eventTimestamp, WatermarkOutput output) {
currentMaxTimestamp = Math.max(currentMaxTimestamp, eventTimestamp);
}
#Override
public void onPeriodicEmit(WatermarkOutput output) {
// emit the watermark as current highest timestamp minus the out-of-orderness bound
output.emitWatermark(new Watermark(currentMaxTimestamp - maxOutOfOrderness - 1));
}
}
How can i add this custom BoundedOutOfOrdernessGenerator to a Watermarkstrategy?
A WatermarkStrategy is the thing you need to define. So assuming you have some class MyWatermarkGenerator that implements WatermarkGenerator<MyEvent>, then you'd do something like:
WatermarkStrategy<WatermarkedRecord> ws = (ctx -> new MyWatermarkGenerator());
...
DataStream<MyEvent> ds = xxx;
ds.assignTimestampsAndWatermarks(ws);
Note that unless your source is setting up timestamps for you (e.g. Kafka record timestamps), you'll want to add a timestamp extractor to your WatermarkStrategy, as in...
WatermarkStrategy<WatermarkedRecord> ws = (ctx -> new MyWatermarkGenerator());
ws = ws.withTimestampAssigner((r, ts) -> r.getTimestamp());

Find stream of events that are not grouped using coGroupFunction

How can we find stream of events that are not matched with other events, when using CoGroupFunction?
Lets consider people are communicating over a phone call. In Tuple2<String, Integer>, f0 is name of person and f1 is phone number they are calling to OR receiving call from.
We have paired them using coGroup, however we are missing people who are getting calls from person outside the world.
final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);
DataStream<Tuple2<String, Integer>> callers = env.fromElements(
new Tuple2<String, Integer>("alice->", 12), // alice dials 12
new Tuple2<String, Integer>("bob->", 13), // bob dials 13
new Tuple2<String, Integer>("charlie->", 19))
.assignTimestampsAndWatermarks(new TimestampExtractor(Time.seconds(5)));
DataStream<Tuple2<String, Integer>> callees = env.fromElements(
new Tuple2<String, Integer>("->carl", 12), // carl received call
new Tuple2<String, Integer>("->ted", 13),
new Tuple2<String, Integer>("->chris", 7))
.assignTimestampsAndWatermarks(new TimestampExtractor(Time.seconds(5)));;
DataStream<Tuple1<String>> groupedStream = callers.coGroup(callees)
.where(evt -> evt.f1).equalTo(evt -> evt.f1)
.window(TumblingEventTimeWindows.of(Time.seconds(10)))
.apply(new IntEqualCoGroupFunc());
groupedStream.print(); // prints 1> (alice->-->carl) \n 1> (bob->-->ted)
//DataStream<Tuple1<String>> notGroupedStream = ..; // people without pairs in last window
//notGroupedStream.print(); // should print charlie->-->someone \n someone->-->chris
env.execute();
To be honest, the simplest solution seems to be changing the IntEqualCoGroupFunc, so that instead of String it returns (Boolean, String).
This is because coGroup processes also those elements that do not have matching keys, those elements will have one Iterable empty in the function coGroup(Iterable<IN1> first, Iterable<IN2> second, Collector<O> out) i.e. for Your case it would receive ("->chris", 7) as first and empty Iterable as second.
The change of the signature could allow You to easily emit also results that do not have matching keys and simply split them into the separate streams at later stage of processing.
// Implementation of IntEqualCoGroupFunc
#Override
public void coGroup(Iterable<Tuple2<String, Integer>> outbound, Iterable<Tuple2<String, Integer>> inbound,
Collector<Tuple1<String>> out) throws Exception {
for (Tuple2<String, Integer> outboundObj : outbound) {
for (Tuple2<String, Integer> inboundObj : inbound) {
out.collect(Tuple1.of(outboundObj.f0 + "-" + inboundObj.f0)); //matching pair
return;
}
out.collect(Tuple1.of(outboundObj.f0 + "->someone")); //inbound is empty
return;
}
// outbound is empty
for (Tuple2<String, Integer> inboundObj : inbound) {
out.collect(Tuple1.of("someone->-" + inboundObj.f0));
return;
}
//inbound also empty
out.collect(Tuple1.of("someone->-->someone"));
}
Output as follows:
2> (someone->-->chris)
2> (charlie->->someone)
1> (alice->-->carl)
1> (bob->-->ted)

Does Apache Flink support multiple events with the same timestamp?

it seems like Apache Flink would not handle well two events with the same timestamp in certain scenarios.
According to the docs a Watermark of t indicates any new events will have a timestamp strictly greater than t. Unless you can completely discard the possibility of two events having the same timestamp then you will not be safe to ever emit a Watermark of t. Enforcing distinct timestamps also limits the number of events per second a system can process to 1000.
Is this really an issue in Apache Flink or is there a workaround?
For those of you that'd like a concrete example to play with, my use case is to build a hourly aggregated rolling word count for an event time ordered stream. For the data sample that I copied in a file (notice the duplicate 9):
mario 0
luigi 1
mario 2
mario 3
vilma 4
fred 5
bob 6
bob 7
mario 8
dan 9
dylan 9
dylan 11
fred 12
mario 13
mario 14
carl 15
bambam 16
summer 17
anna 18
anna 19
edu 20
anna 21
anna 22
anna 23
anna 24
anna 25
And the code:
public static void main(String[] args) throws Exception {
final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment()
.setParallelism(1)
.setMaxParallelism(1);
env.setStreamTimeCharacteristic(EventTime);
String fileLocation = "full file path here";
DataStreamSource<String> rawInput = env.readFile(new TextInputFormat(new Path(fileLocation)), fileLocation);
rawInput.flatMap(parse())
.assignTimestampsAndWatermarks(new AssignerWithPunctuatedWatermarks<TimestampedWord>() {
#Nullable
#Override
public Watermark checkAndGetNextWatermark(TimestampedWord lastElement, long extractedTimestamp) {
return new Watermark(extractedTimestamp);
}
#Override
public long extractTimestamp(TimestampedWord element, long previousElementTimestamp) {
return element.getTimestamp();
}
})
.keyBy(TimestampedWord::getWord)
.process(new KeyedProcessFunction<String, TimestampedWord, Tuple3<String, Long, Long>>() {
private transient ValueState<Long> count;
#Override
public void open(Configuration parameters) throws Exception {
count = getRuntimeContext().getState(new ValueStateDescriptor<>("counter", Long.class));
}
#Override
public void processElement(TimestampedWord value, Context ctx, Collector<Tuple3<String, Long, Long>> out) throws Exception {
if (count.value() == null) {
count.update(0L);
setTimer(ctx.timerService(), value.getTimestamp());
}
count.update(count.value() + 1);
}
#Override
public void onTimer(long timestamp, OnTimerContext ctx, Collector<Tuple3<String, Long, Long>> out) throws Exception {
long currentWatermark = ctx.timerService().currentWatermark();
out.collect(new Tuple3(ctx.getCurrentKey(), count.value(), currentWatermark));
if (currentWatermark < Long.MAX_VALUE) {
setTimer(ctx.timerService(), currentWatermark);
}
}
private void setTimer(TimerService service, long t) {
service.registerEventTimeTimer(((t / 10) + 1) * 10);
}
})
.addSink(new PrintlnSink());
env.execute();
}
private static FlatMapFunction<String, TimestampedWord> parse() {
return new FlatMapFunction<String, TimestampedWord>() {
#Override
public void flatMap(String value, Collector<TimestampedWord> out) {
String[] wordsAndTimes = value.split(" ");
out.collect(new TimestampedWord(wordsAndTimes[0], Long.parseLong(wordsAndTimes[1])));
}
};
}
private static class TimestampedWord {
private final String word;
private final long timestamp;
private TimestampedWord(String word, long timestamp) {
this.word = word;
this.timestamp = timestamp;
}
public String getWord() {
return word;
}
public long getTimestamp() {
return timestamp;
}
}
private static class PrintlnSink implements org.apache.flink.streaming.api.functions.sink.SinkFunction<Tuple3<String, Long, Long>> {
#Override
public void invoke(Tuple3<String, Long, Long> value, Context context) throws Exception {
long timestamp = value.getField(2);
System.out.println(value.getField(0) + "=" + value.getField(1) + " at " + (timestamp - 10) + "-" + (timestamp - 1));
}
}
I get
mario=4 at 1-10
dylan=2 at 1-10
luigi=1 at 1-10
fred=1 at 1-10
bob=2 at 1-10
vilma=1 at 1-10
dan=1 at 1-10
vilma=1 at 10-19
luigi=1 at 10-19
mario=6 at 10-19
carl=1 at 10-19
bambam=1 at 10-19
dylan=2 at 10-19
summer=1 at 10-19
anna=2 at 10-19
bob=2 at 10-19
fred=2 at 10-19
dan=1 at 10-19
fred=2 at 9223372036854775797-9223372036854775806
dan=1 at 9223372036854775797-9223372036854775806
carl=1 at 9223372036854775797-9223372036854775806
mario=6 at 9223372036854775797-9223372036854775806
vilma=1 at 9223372036854775797-9223372036854775806
edu=1 at 9223372036854775797-9223372036854775806
anna=7 at 9223372036854775797-9223372036854775806
summer=1 at 9223372036854775797-9223372036854775806
bambam=1 at 9223372036854775797-9223372036854775806
luigi=1 at 9223372036854775797-9223372036854775806
bob=2 at 9223372036854775797-9223372036854775806
dylan=2 at 9223372036854775797-9223372036854775806
Notice dylan=2 at 0-9 where it should be 1.
No, there isn't a problem with having stream elements with the same timestamp. But a Watermark is an assertion that all events that follow will have timestamps greater than the watermark, so this does mean that you cannot safely emit a Watermark t for a stream element at time t, unless the timestamps in the stream are strictly monotonically increasing -- which is not the case if there are multiple events with the same timestamp. This is why the AscendingTimestampExtractor produces watermarks equal to currentTimestamp - 1, and you should do the same.
Notice that your application is actually reporting that dylan=2 at 0-10, not at 0-9. This is because the watermark resulting from dylan at time 11 is triggering the first timer (the timer set for time 10, but since there is no element with a timestamp of 10, that timer doesn't fire until the watermark from "dylan 11" arrives). And your PrintlnSink uses timestamp - 1 to indicate the upper end of the timespan, hence 11 - 1, or 10, rather than 9.
There's nothing wrong with the output of your ProcessFunction, which looks like this:
(mario,4,11)
(dylan,2,11)
(luigi,1,11)
(fred,1,11)
(bob,2,11)
(vilma,1,11)
(dan,1,11)
(vilma,1,20)
(luigi,1,20)
(mario,6,20)
(carl,1,20)
(bambam,1,20)
(dylan,2,20)
...
It is true that by time 11 there have been two dylans. But the report produced by PrintlnSink is misleading.
Two things need to be changed to get your example working as intended. First, the watermarks need to satisfy the watermarking contract, which isn't currently the case, and second, the windowing logic isn't quite right. The ProcessFunction needs to be prepared for the "dylan 11" event to arrive before the timer closing the window for 0-9 has fired. This is because the "dylan 11" stream element precedes the watermark generated from it in the stream.
Update: events whose timestamps are beyond the current window (such as "dylan 11") can be handled by
keep track of when the current window ends
rather than incrementing the counter, add events for times after the current window to a list
after a window ends, consume events from that list that fall into the next window

The begin and end time of windows

What would be the way to show the begin and end time of windows? something like implement user-defined windows?
Would like to know the time that windows begin and evaluate such that the output is
quantity(WindowAll Sum), window_start_time, window_end_time
12, 1:13:21, 1:13:41
6, 1:13:41, 1:15:01
Found the answer. TimeWindow.class has getStart() and getEnd()
example usage:
public static class SumAllWindow implements AllWindowFunction<Tuple2<String,Integer>,
Tuple3<Integer, String, String>, TimeWindow> {
private static transient DateTimeFormatter timeFormatter =
DateTimeFormat.forPattern("yyyy-MM-dd'T'HH:mm:ss.SS").withLocale(Locale.GERMAN).
withZone(DateTimeZone.forID("Europe/Berlin"));
#Override
public void apply (TimeWindow window, Iterable<Tuple2<String, Integer>> values,
Collector<Tuple3<Integer, String, String>> out) throws Exception {
DateTime startTs = new DateTime(window.getStart(), DateTimeZone.forID("Europe/Berlin"));
DateTime endTs = new DateTime(window.getEnd(), DateTimeZone.forID("Europe/Berlin"));
int sum = 0;
for (Tuple2<String, Integer> value : values) {
sum += value.f1;
}
out.collect(new Tuple3<>(sum, startTs.toString(timeFormatter), endTs.toString(timeFormatter)));
}
}
in main()
msgStream.timeWindowAll(Time.of(6, TimeUnit.SECONDS)).apply(new SumAllWindow()).print();

How to avoid repeated tuples in Flink slide window join?

For example, there are two streams. One is advertisements showed to users. The tuple in which could be described as (advertiseId, showed timestamp). The other one is click stream -- (advertiseId, clicked timestamp). We want get a joined stream, which includes all the advertisement that is clicked by user in 20 minutes after showed. My solution is to join these two streams on a SlidingTimeWindow. But in the joined stream, there are many repeated tuples. How could I get joined tuple only one in new stream?
stream1.join(stream2)
.where(0)
.equalTo(0)
.window(SlidingTimeWindows.of(Time.of(30, TimeUnit.MINUTES), Time.of(10, TimeUnit.SECONDS)))
Solution 1:
Let flink support join two streams on separate windows like Spark streaming. In this case, implement SlidingTimeWindows(21 mins, 1 min) on advertisement stream and TupblingTimeWindows(1 min) on Click stream, then join these two windowed streams.
TupblingTimeWindows could avoid duplicate records in the joined stream.
21 mins size SlidingTimeWindows could avoid missing legal clicks.
One issue is there would be some illegal click(click after 20 mins) in the joined stream. This problem could be fixed easily by adding a filter.
MultiWindowsJoinedStreams<Tuple2<String, Long>, Tuple2<String, Long>> joinedStreams =
new MultiWindowsJoinedStreams<>(advertisement, click);
DataStream<Tuple3<String, Long, Long>> joinedStream = joinedStreams.where(keySelector)
.window(SlidingTimeWindows.of(Time.of(21, TimeUnit.SECONDS), Time.of(1, TimeUnit.SECONDS)))
.equalTo(keySelector)
.window(TumblingTimeWindows.of(Time.of(1, TimeUnit.SECONDS)))
.apply(new JoinFunction<Tuple2<String, Long>, Tuple2<String, Long>, Tuple3<String, Long, Long>>() {
private static final long serialVersionUID = -3625150954096822268L;
#Override
public Tuple3<String, Long, Long> join(Tuple2<String, Long> first, Tuple2<String, Long> second) throws Exception {
return new Tuple3<>(first.f0, first.f1, second.f1);
}
});
joinedStream = joinedStream.filter(new FilterFunction<Tuple3<String, Long, Long>>() {
private static final long serialVersionUID = -4325256210808325338L;
#Override
public boolean filter(Tuple3<String, Long, Long> value) throws Exception {
return value.f1<value.f2&&value.f1+20000>=value.f2;
}
});
Solution 2:
Flink supports join operation without window. A join operator implement the interface TwoInputStreamOperator keeps two buffers(time length based) of these two streams and output one joined stream.
DataStream<Tuple2<String, Long>> advertisement = env
.addSource(new FlinkKafkaConsumer082<String>("advertisement", new SimpleStringSchema(), properties))
.map(new MapFunction<String, Tuple2<String, Long>>() {
private static final long serialVersionUID = -6564495005753073342L;
#Override
public Tuple2<String, Long> map(String value) throws Exception {
String[] splits = value.split(" ");
return new Tuple2<String, Long>(splits[0], Long.parseLong(splits[1]));
}
}).keyBy(keySelector).assignTimestamps(timestampExtractor1);
DataStream<Tuple2<String, Long>> click = env
.addSource(new FlinkKafkaConsumer082<String>("click", new SimpleStringSchema(), properties))
.map(new MapFunction<String, Tuple2<String, Long>>() {
private static final long serialVersionUID = -6564495005753073342L;
#Override
public Tuple2<String, Long> map(String value) throws Exception {
String[] splits = value.split(" ");
return new Tuple2<String, Long>(splits[0], Long.parseLong(splits[1]));
}
}).keyBy(keySelector).assignTimestamps(timestampExtractor2);
NoWindowJoinedStreams<Tuple2<String, Long>, Tuple2<String, Long>> joinedStreams =
new NoWindowJoinedStreams<>(advertisement, click);
DataStream<Tuple3<String, Long, Long>> joinedStream = joinedStreams
.where(keySelector)
.buffer(Time.of(20, TimeUnit.SECONDS))
.equalTo(keySelector)
.buffer(Time.of(5, TimeUnit.SECONDS))
.apply(new JoinFunction<Tuple2<String, Long>, Tuple2<String, Long>, Tuple3<String, Long, Long>>() {
private static final long serialVersionUID = -5075871109025215769L;
#Override
public Tuple3<String, Long, Long> join(Tuple2<String, Long> first, Tuple2<String, Long> second) throws Exception {
return new Tuple3<>(first.f0, first.f1, second.f1);
}
});
I implemented two new join operators base on Flink streaming API TwoInputTransformation. Please check Flink-stream-join. I will add more tests to this repository.
On your code, you defined an overlapping sliding window (slide is smaller than window size). If you don't want to have duplicates you can define a non-overlapping window by only specifying the window size (the default slide is equal to the window size).
While searching for a solution for the same problem, I found the "Interval Join" very useful, which does not repeatedly output the same elements. This is the example from the Flink documentation:
DataStream<Integer> orangeStream = ...
DataStream<Integer> greenStream = ...
orangeStream
.keyBy(<KeySelector>)
.intervalJoin(greenStream.keyBy(<KeySelector>))
.between(Time.milliseconds(-2), Time.milliseconds(1))
.process (new ProcessJoinFunction<Integer, Integer, String(){
#Override
public void processElement(Integer left, Integer right, Context ctx, Collector<String> out) {
out.collect(first + "," + second);
}
});
With this no explicit window has to be defined, instead an interval that is used for each single element like this (image from Flink documentation):

Resources