Does Apache Flink support multiple events with the same timestamp? - apache-flink

it seems like Apache Flink would not handle well two events with the same timestamp in certain scenarios.
According to the docs a Watermark of t indicates any new events will have a timestamp strictly greater than t. Unless you can completely discard the possibility of two events having the same timestamp then you will not be safe to ever emit a Watermark of t. Enforcing distinct timestamps also limits the number of events per second a system can process to 1000.
Is this really an issue in Apache Flink or is there a workaround?
For those of you that'd like a concrete example to play with, my use case is to build a hourly aggregated rolling word count for an event time ordered stream. For the data sample that I copied in a file (notice the duplicate 9):
mario 0
luigi 1
mario 2
mario 3
vilma 4
fred 5
bob 6
bob 7
mario 8
dan 9
dylan 9
dylan 11
fred 12
mario 13
mario 14
carl 15
bambam 16
summer 17
anna 18
anna 19
edu 20
anna 21
anna 22
anna 23
anna 24
anna 25
And the code:
public static void main(String[] args) throws Exception {
final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment()
.setParallelism(1)
.setMaxParallelism(1);
env.setStreamTimeCharacteristic(EventTime);
String fileLocation = "full file path here";
DataStreamSource<String> rawInput = env.readFile(new TextInputFormat(new Path(fileLocation)), fileLocation);
rawInput.flatMap(parse())
.assignTimestampsAndWatermarks(new AssignerWithPunctuatedWatermarks<TimestampedWord>() {
#Nullable
#Override
public Watermark checkAndGetNextWatermark(TimestampedWord lastElement, long extractedTimestamp) {
return new Watermark(extractedTimestamp);
}
#Override
public long extractTimestamp(TimestampedWord element, long previousElementTimestamp) {
return element.getTimestamp();
}
})
.keyBy(TimestampedWord::getWord)
.process(new KeyedProcessFunction<String, TimestampedWord, Tuple3<String, Long, Long>>() {
private transient ValueState<Long> count;
#Override
public void open(Configuration parameters) throws Exception {
count = getRuntimeContext().getState(new ValueStateDescriptor<>("counter", Long.class));
}
#Override
public void processElement(TimestampedWord value, Context ctx, Collector<Tuple3<String, Long, Long>> out) throws Exception {
if (count.value() == null) {
count.update(0L);
setTimer(ctx.timerService(), value.getTimestamp());
}
count.update(count.value() + 1);
}
#Override
public void onTimer(long timestamp, OnTimerContext ctx, Collector<Tuple3<String, Long, Long>> out) throws Exception {
long currentWatermark = ctx.timerService().currentWatermark();
out.collect(new Tuple3(ctx.getCurrentKey(), count.value(), currentWatermark));
if (currentWatermark < Long.MAX_VALUE) {
setTimer(ctx.timerService(), currentWatermark);
}
}
private void setTimer(TimerService service, long t) {
service.registerEventTimeTimer(((t / 10) + 1) * 10);
}
})
.addSink(new PrintlnSink());
env.execute();
}
private static FlatMapFunction<String, TimestampedWord> parse() {
return new FlatMapFunction<String, TimestampedWord>() {
#Override
public void flatMap(String value, Collector<TimestampedWord> out) {
String[] wordsAndTimes = value.split(" ");
out.collect(new TimestampedWord(wordsAndTimes[0], Long.parseLong(wordsAndTimes[1])));
}
};
}
private static class TimestampedWord {
private final String word;
private final long timestamp;
private TimestampedWord(String word, long timestamp) {
this.word = word;
this.timestamp = timestamp;
}
public String getWord() {
return word;
}
public long getTimestamp() {
return timestamp;
}
}
private static class PrintlnSink implements org.apache.flink.streaming.api.functions.sink.SinkFunction<Tuple3<String, Long, Long>> {
#Override
public void invoke(Tuple3<String, Long, Long> value, Context context) throws Exception {
long timestamp = value.getField(2);
System.out.println(value.getField(0) + "=" + value.getField(1) + " at " + (timestamp - 10) + "-" + (timestamp - 1));
}
}
I get
mario=4 at 1-10
dylan=2 at 1-10
luigi=1 at 1-10
fred=1 at 1-10
bob=2 at 1-10
vilma=1 at 1-10
dan=1 at 1-10
vilma=1 at 10-19
luigi=1 at 10-19
mario=6 at 10-19
carl=1 at 10-19
bambam=1 at 10-19
dylan=2 at 10-19
summer=1 at 10-19
anna=2 at 10-19
bob=2 at 10-19
fred=2 at 10-19
dan=1 at 10-19
fred=2 at 9223372036854775797-9223372036854775806
dan=1 at 9223372036854775797-9223372036854775806
carl=1 at 9223372036854775797-9223372036854775806
mario=6 at 9223372036854775797-9223372036854775806
vilma=1 at 9223372036854775797-9223372036854775806
edu=1 at 9223372036854775797-9223372036854775806
anna=7 at 9223372036854775797-9223372036854775806
summer=1 at 9223372036854775797-9223372036854775806
bambam=1 at 9223372036854775797-9223372036854775806
luigi=1 at 9223372036854775797-9223372036854775806
bob=2 at 9223372036854775797-9223372036854775806
dylan=2 at 9223372036854775797-9223372036854775806
Notice dylan=2 at 0-9 where it should be 1.

No, there isn't a problem with having stream elements with the same timestamp. But a Watermark is an assertion that all events that follow will have timestamps greater than the watermark, so this does mean that you cannot safely emit a Watermark t for a stream element at time t, unless the timestamps in the stream are strictly monotonically increasing -- which is not the case if there are multiple events with the same timestamp. This is why the AscendingTimestampExtractor produces watermarks equal to currentTimestamp - 1, and you should do the same.
Notice that your application is actually reporting that dylan=2 at 0-10, not at 0-9. This is because the watermark resulting from dylan at time 11 is triggering the first timer (the timer set for time 10, but since there is no element with a timestamp of 10, that timer doesn't fire until the watermark from "dylan 11" arrives). And your PrintlnSink uses timestamp - 1 to indicate the upper end of the timespan, hence 11 - 1, or 10, rather than 9.
There's nothing wrong with the output of your ProcessFunction, which looks like this:
(mario,4,11)
(dylan,2,11)
(luigi,1,11)
(fred,1,11)
(bob,2,11)
(vilma,1,11)
(dan,1,11)
(vilma,1,20)
(luigi,1,20)
(mario,6,20)
(carl,1,20)
(bambam,1,20)
(dylan,2,20)
...
It is true that by time 11 there have been two dylans. But the report produced by PrintlnSink is misleading.
Two things need to be changed to get your example working as intended. First, the watermarks need to satisfy the watermarking contract, which isn't currently the case, and second, the windowing logic isn't quite right. The ProcessFunction needs to be prepared for the "dylan 11" event to arrive before the timer closing the window for 0-9 has fired. This is because the "dylan 11" stream element precedes the watermark generated from it in the stream.
Update: events whose timestamps are beyond the current window (such as "dylan 11") can be handled by
keep track of when the current window ends
rather than incrementing the counter, add events for times after the current window to a list
after a window ends, consume events from that list that fall into the next window

Related

Flink Watermarks on Event Time

I'm trying to understand watermarks with Event Time.
My Code is similar with Flink Documentation WordCount example .
I did some changes to include timestamp on event and added watermarks.
Event format is: word;timestamp
The map function creates a tuple3 with word;1;timestamp.
Then it's assign a watermark strategy with timestamp assigner equals to event timestamp field.
For the following stream events:
test;1662128808294
test;1662128818065
test;1662128822434
test;1662128826434
test;1662128831175
test;1662128836581
I got the following result: (test,6) => This is correct, i sent 6 times test word.
But looking for context in ProcessFunction i see the following:
Processing Time: Fri Sep 02 15:27:20 WEST 2022
Watermark: Fri Sep 02 15:26:56 WEST 2022
Start Window: 2022 09 02 15:26:40 End Window: 2022 09 02 15:27:20
The window it's correct, it's 40 seconds window as defined, and the watermark also it's correct, it's 20 seconds less the last event timestamp (1662128836581 = Friday, September 2, 2022 3:27:16) as defined in watermark strategy.
My Question is the window processing time. The window fired exactly at end window time processing time, but shouldn't wait until watermark pass the end of window (something like processing time = end of window + 20 seconds) (Window Default Trigger Docs) ?
What i'm doing wrong? or i'm having a bad understanding about watermarks?
My Code:
public class DataStreamJob {
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(1);
WatermarkStrategy<Tuple3<String, Integer, Long>> strategy = WatermarkStrategy
.<Tuple3<String, Integer, Long>>forBoundedOutOfOrderness(Duration.ofSeconds(20))
.withTimestampAssigner((event, timestamp) -> event.f2);
DataStream<Tuple2<String, Integer>> dataStream = env
.socketTextStream("localhost", 9999)
.map(new Splitter())
.assignTimestampsAndWatermarks(strategy)
.keyBy(value -> value.f0)
.window(TumblingProcessingTimeWindows.of(Time.seconds(40)))
.process(new MyProcessWindowFunction());
dataStream.print();
env.execute("Window WordCount");
}
public static class Splitter extends RichMapFunction<String, Tuple3<String, Integer, Long>> {
#Override
public Tuple3<String, Integer, Long> map(String value) throws Exception {
String[] word = value.split(";");
return new Tuple3<String, Integer, Long>(word[0], 1, Long.parseLong(word[1]));
}
}
public static class MyProcessWindowFunction extends ProcessWindowFunction<Tuple3<String, Integer, Long>, Tuple2<String, Integer>, String, TimeWindow> {
#Override
public void process(String s, ProcessWindowFunction<Tuple3<String, Integer, Long>, Tuple2<String, Integer>, String, TimeWindow>.Context context, Iterable<Tuple3<String, Integer, Long>> elements, Collector<Tuple2<String, Integer>> out) throws Exception {
Integer sum = 0;
for (Tuple3<String, Integer, Long> in : elements) {
sum++;
}
out.collect(new Tuple2<String, Integer>(s, sum));
Date date = new Date(context.window().getStart());
Date date2 = new Date(context.window().getEnd());
Date watermark = new Date(context.currentWatermark());
Date processingTime = new Date(context.currentProcessingTime());
System.out.println(context.currentWatermark());
System.out.println("Processing Time: " + processingTime);
Format format = new SimpleDateFormat("yyyy MM dd HH:mm:ss");
System.out.println("Watermark: " + watermark);
System.out.println("Start Window: " + format.format(date) + " End Window: " + format.format(date2));
}
}
}
Thanks.
To get event time windows, you need to change
.window(TumblingProcessingTimeWindows.of(Time.seconds(40)))
to
.window(TumblingEventTimeWindows.of(Time.seconds(40)))

Change in Behavior for EventTimeSessionWindows from Flink 1.11.1 to 1.14.0

I observed what appears to be a change in behavior for EventTimeSessionWindows when upgrading from 1.11.1 to 1.14.0. This was identified in a unit test.
For a window with a defined time gap of 10 seconds.
Publish KEY_1 with eventtime 1 second
Publish KEY_1 with eventtime 3 seconds
Publish KEY_1 with eventtime 2 seconds
Publish KEY_2 with eventtime 101 seconds
For Flink 1.11.1 the window for KEY_1 closes, reduces, and publishes, supposedly because KEY_2 had an event time greater than 10 seconds after the last message in KEY_1's window. KEY_2 window would also not close. In the absence of KEY_2 the KEY_1 window would not close.
For Flink 1.14.0 the main difference is that the window for KEY_2 DOES close even though there are no new messages after 111 seconds.
This appears to be a change in behavior. The nearest I could find was https://issues.apache.org/jira/browse/FLINK-20443 but that’s in 1.14.1. I also noticed https://issues.apache.org/jira/browse/FLINK-19777 which was in 1.11.3 but couldn't ascertain if that would have resulted in this behavior. Is there an explanation for this change in behavior? Is it expected or desirable? Is it because all pending windows are automatically closed based on an updated trigger behavior?
I tested the same behavior for ProcessingTimeSessionWindows and did not observe a similar change in behavior.
Thanks.
Jai
#Test
public void testEventTime() throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
// configure your test environment
env.setParallelism(1);
env.getConfig().registerTypeWithKryoSerializer(Document.class, ProtobufSerializer.class);
// values are collected in a static variable
CollectSink.values.clear();
// create a stream of custom elements and apply transformations
SingleOutputStreamOperator<Tuple2<String, Document>> inputStream = buildStream(env, this.generateTestOrders());
SingleOutputStreamOperator<Tuple2<String, Document>> intermediateStream = this.documentDebounceFunction.insertIntoPipeline(inputStream);
intermediateStream.addSink(new CollectSink());
// execute
env.execute();
// verify your results
Assertions.assertEquals(1, CollectSink.values.size());
Map<String, Long> expectedVersions = Maps.newHashMap();
expectedVersions.put(KEY_1, 2L);
for (Tuple2<String, Document> actual : CollectSink.values) {
Assertions.assertEquals(expectedVersions.get(actual.f0), actual.f1.getVersion());
}
}
// create a testing sink
private static class CollectSink implements SinkFunction<Tuple2<String, Document>> {
// must be static
public static final List<Tuple2<String, Document>> values = Collections.synchronizedList(new ArrayList<>());
#Override
public void invoke(Tuple2<String, Document> value, SinkFunction.Context context) {
values.add(value);
}
}
public List<Tuple2<String, Document>> generateTestOrders() {
List<Tuple2<String, Document>> testMessages = Lists.newArrayList();
// KEY_1
testMessages.add(
Tuple2.of(
KEY_1,
Document.newBuilder()
.setVersion(1)
.setUpdatedAt(
Timestamp.newBuilder().setSeconds(1).build())
.build()));
testMessages.add(
Tuple2.of(
KEY_1,
Document.newBuilder()
.setVersion(2)
.setUpdatedAt(
Timestamp.newBuilder().setSeconds(3).build())
.build()));
testMessages.add(
Tuple2.of(
KEY_1,
Document.newBuilder()
.setVersion(3)
.setUpdatedAt(
Timestamp.newBuilder().setSeconds(2).build())
.build()));
// KEY_2 -- WAY IN THE FUTURE
testMessages.add(
Tuple2.of(
KEY_2,
Document.newBuilder()
.setVersion(15)
.setUpdatedAt(
Timestamp.newBuilder().setSeconds(101).build())
.build()));
return ImmutableList.copyOf(testMessages);
}
private SingleOutputStreamOperator<Tuple2<String, Document>> buildStream(
StreamExecutionEnvironment executionEnvironment,
List<Tuple2<String, Document>> inputMessages) {
inputMessages =
inputMessages.stream()
.sorted(
Comparator.comparingInt(
msg -> (int) ProtobufTypeConversion.toMillis(msg.f1.getUpdatedAt())))
.collect(Collectors.toList());
WatermarkStrategy<Tuple2<String, Document>> watermarkStrategy =
WatermarkStrategy.forMonotonousTimestamps();
return executionEnvironment
.fromCollection(
inputMessages, TypeInformation.of(new TypeHint<Tuple2<String, Document>>() {}))
.assignTimestampsAndWatermarks(
watermarkStrategy.withTimestampAssigner(
(event, timestamp) -> Timestamps.toMillis(event.f1.getUpdatedAt())));
}

Flink event session window why not emit

I'm losing my mind. It took me 10 hours, but it's still not working!!!
I use flink session window join two streams.
with EventTime, and using session window join two streams one same value.
code as follow
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);
env.setParallelism(1);
// current time
long currentTime = System.currentTimeMillis();
// topic of datahub which start consume time milliseconds
final long START_TIME_MS = 0L;
// session window gap time
final Time WIN_GAP_TIME = Time.seconds(10L);
// source function maxOutOfOrderness milliseconds
final Time maxOutOfOrderness = Time.milliseconds(5L);
// init source function
DatahubSourceFunction oldTableASourceFun =
new DatahubSourceFunction(
endPoint,
projectName,
topicOldTableA,
accessId,
accessKey,
// currentTime,
START_TIME_MS,
Long.MAX_VALUE,
20,
1000,
20);
DatahubSourceFunction tableBSourceFun =
new DatahubSourceFunction(
endPoint,
projectName,
topicTableB,
accessId,
accessKey,
START_TIME_MS,
Long.MAX_VALUE,
20,
1000,
20);
// init source
DataStream<OldTableA> oldTableADataStream =
env.addSource(oldTableASourceFun)
.flatMap(
new FlatMapFunction<List<RecordEntry>, OldTableA>() {
#Override
public void flatMap(List<RecordEntry> list, Collector<OldTableA> out)
throws Exception {
for (RecordEntry recordEntry : list) {
out.collect(CommonUtils.convertToOldTableA(recordEntry));
}
}
})
.uid("oldTableADataSource")
.setParallelism(1)
.returns(new TypeHint<OldTableA>() {})
.assignTimestampsAndWatermarks(new ExtractorWM<OldTableA>(maxOutOfOrderness));
DataStream<TableB> tableBDataStream =
env.addSource(tableBSourceFun)
.flatMap(
new FlatMapFunction<List<RecordEntry>, TableB>() {
#Override
public void flatMap(java.util.List<RecordEntry> list, Collector<TableB> out)
throws Exception {
for (RecordEntry recordEntry : list) {
out.collect(CommonUtils.convertToTableB(recordEntry));
}
}
})
.uid("tableBDataSource")
.setParallelism(1)
.returns(new TypeHint<TableB>() {})
.assignTimestampsAndWatermarks(new ExtractorWM<TableB>(maxOutOfOrderness));
and ExtractorWM code as follow
public class ExtractorWM<T extends CommonPOJO> extends BoundedOutOfOrdernessTimestampExtractor<T> {
public ExtractorWM(Time maxOutOfOrderness) {
super(maxOutOfOrderness);
}
#Override
public long extractTimestamp(T element) {
/* it's ok System.out.println(element +"-"+ CommonUtils.getSimpleDateFormat().format(getCurrentWatermark().getTimestamp()));*/
return System.currentTimeMillis();
}
}
I tested the above code to output correctly, watermark and event is right
// oldTableADataStream event and watermark'ts
OldTableA{PA1=1, a2='a20', **fa3=20**, fa4=30} 1596092987721
OldTableA{PA1=2, a2='a20', **fa3=20**, fa4=31} 1596092987721
OldTableA{PA1=3, a2='a20', **fa3=20**, fa4=32} 1596092987721
OldTableA{PA1=4, a2='a20', **fa3=20**, fa4=33} 1596092987721
OldTableA{PA1=5, a2='a20', **fa3=20**, fa4=34} 1596092987722
//
tableBDataStream event and watermark'ts
TableB{**PB1=20**, B2='b20', B3='b30'} 1596092987721
TableB{PB1=21, B2='b21', B3='b31'} 1596092987721
TableB{PB1=22, B2='b22', B3='b32'} 1596092987721
TableB{PB1=23, B2='b23', B3='b33'} 1596092987722
TableB{PB1=24, B2='b24', B3='b34'} 1596092987722
I except result as
1 a20 20 b20 b30 30
2 a20 20 b20 b30 31
3 a20 20 b20 b30 32
4 a20 20 b20 b30 33
5 a20 20 b20 b30 34
6 a20 20 b20 b30 35
but join operator is not work
DataStream<NewTableA> join1 =
oldTableADataStream
.join(tableBDataStream)
.where(t1 -> t1.getFa3()) // print element is out right
.equalTo(t2 -> t2.getPb1()) // print element is out right
.window(EventTimeSessionWindows.withGap(WIN_GAP_TIME))
// .trigger(new TestTrigger())
// .allowedLateness(Time.seconds(2))
.apply(new oldTableAJoinTableBFunc()); // test join method not work, join method not be call
join1.print(); // it's nothing
and oldTableAJoinTableBFunc code as follow
public class oldTableAJoinTableBFunc implements JoinFunction<OldTableA, TableB, NewTableA> {
#Override
public NewTableA join(OldTableA oldTableA, TableB tableB) throws Exception {
// not working
// I breakpoint join code line and debug ,but never trigger
System.out.println(
oldTableA
+ " - "
+ tableB
+ " - "
+ CommonUtils.getSimpleDateFormat().format(System.currentTimeMillis()));
NewTableA newTableA = new NewTableA();
newTableA.setPA1(oldTableA.getPa1());
newTableA.setA2(oldTableA.getA2());
newTableA.setFA3(oldTableA.getFa3());
newTableA.setFA4(oldTableA.getFa4());
newTableA.setB2(tableB.getB2());
newTableA.setB3(tableB.getB3());
return newTableA;
}
}
The problem I see is that apply(new oldTableAJoinTableBFunc()) , I breakpoint join method and debug , but never be breaked, then join method not be call. I studied source code as join method be called when pair is happen, then I print t1.getFa3() and t2.getPb1() as least one line 20 is equal, why join not be called?
Your approach to handling time and watermarking is why this isn't working. For a more in-depth introduction to the topic, see the section of the Flink training course that covers event time and watermarking.
The motivation behind event time processing is to be able to implement consistent, deterministic streaming analytics despite events arriving out-of-order and being processed at some unknown rate. This depends on the events carrying timestamps, and on those timestamps being extracted from the events by a timestamp assigner. Your timestamp assigner is returning System.currentTimeMillis, which effectively disables all of the event time machinery. Moreover, because you are using System.currentTimeMillis as the source of timing information your events can not be out-of-order, yet you are specifying a watermarking delay of 5 msec.
I doubt your job even runs for 5 msec, so it may not be generating any watermarks at all. Plus, it will actually take 200 msec before Flink sends the first watermark (see below).
For a session to end, there will have to be a 10 second interval during which no events are processed. (If you switch to using proper event time timestamps, then those timestamps will need a gap of 10+ seconds, but since you are using System.currentTimeMillis as the source of timing info, your job needs a gap of 10 real-time seconds to close a session.)
A BoundedOutOfOrdernessTimestampExtractor generates watermarks by observing the timestamps in the stream, and every 200 msec (by default) it injects a Watermark into the stream whose value is computed by taking the largest timestamp seen so far in the event stream, and subtracting from it the bounded delay (5 msec). A 10 second long event time session window will only close when a Watermark arrives that is at least 10 seconds later than the timestamp of the latest event currently in the session. For such a watermark to be created, a suitable event with a sufficiently large timestamp has to been processed.
I found the answer!
the reason is DatahubSourceFunction which consume datahub(Aliyun service like Kafka) topic and emit into flink. but when no record be consumed then the timestamp of watermarks is over and over again.
I use BoundedOutOfOrdernessTimestampExtractor generate watermark which feature is need extract timestamp from event to watermark, then the watermark is generated when there's an event, and the timestamp of the watermark is generated to the same value when there's no event.
when DatahubSourceFunction consume the last recorder and emit the last event, then no more event be emitted.
then the timestamp of the last event same as the timestamp of the last watermark(System.currentTimeMillis()).then session window never ends because all timestamp of watermarks less than window GAP + the timestamp of the last event.
the session window no end then join function not be called.

How to create batch or slide windows using Flink CEP?

I'm just starting with Flink CEP and I come from Esper CEP engine. As you may (or not) know, in Esper using their syntax (EPL) you can create a batch or slide window easily, grouping the events in those windows and allowing you to use this events with functions (avg, max, min, ...).
For example, with the following pattern you can create a batch windows of 5 seconds and calculate the average value of the attribute price of all the Stock events that you have received in that specified window.
select avg(price) from Stock#time_batch(5 sec)
The thing is I would like to know how to implement this on Flink CEP. I'm aware that, probably, the goal or approach in Flink CEP is different, so the way to implement this may not be as simple as in Esper CEP.
I have taken a look at the docs regarding to time windows, but I'm not able to implement this windows along with Flink CEP. So, given the following code:
DataStream<Stock> stream = ...; // Consume events from Kafka
// Filtering events with negative price
Pattern<Stock, ?> pattern = Pattern.<Stock>begin("start")
.where(
new SimpleCondition<Stock>() {
public boolean filter(Stock event) {
return event.getPrice() >= 0;
}
}
);
PatternStream<Stock> patternStream = CEP.pattern(stream, pattern);
/**
CREATE A BATCH WINDOW OF 5 SECONDS IN WHICH
I COMPUTE OVER THE AVERAGE PRICES AND, IF IT IS
GREATER THAN A THREESHOLD, AN ALERT IS DETECTED
return avg(allEventsInWindow.getPrice()) > 1;
*/
DataStream<Alert> result = patternStream.select(
new PatternSelectFunction<Stock, Alert>() {
#Override
public Alert select(Map<String, List<Stock>> pattern) throws Exception {
return new Alert(pattern.toString());
}
}
);
How can I create that window in which, from the first one received, I start to calculate the average for the following events within 5 seconds. For example:
t = 0 seconds
Stock(price = 1); (...starting batch window...)
Stock(price = 1);
Stock(price = 1);
Stock(price = 2);
Stock(price = 2);
Stock(price = 2);
t = 5 seconds (...end of batch window...)
Avg = 1.5 => Alert detected!
The average after 5 seconds would be 1.5, and will trigger the alert. How can I code this?
Thanks!
With Flink's CEP library this behavior is not expressible. I would rather recommend using Flink's DataStream or Table API to calculate the averages. Based on that you could again use CEP to generate other events.
final DataStream<Stock> input = env
.fromElements(
new Stock(1L, 1.0),
new Stock(2L, 2.0),
new Stock(3L, 1.0),
new Stock(4L, 2.0))
.assignTimestampsAndWatermarks(new BoundedOutOfOrdernessTimestampExtractor<Stock>(Time.seconds(0L)) {
#Override
public long extractTimestamp(Stock element) {
return element.getTimestamp();
}
});
final DataStream<Double> windowAggregation = input
.timeWindowAll(Time.milliseconds(2))
.aggregate(new AggregateFunction<Stock, Tuple2<Integer, Double>, Double>() {
#Override
public Tuple2<Integer, Double> createAccumulator() {
return Tuple2.of(0, 0.0);
}
#Override
public Tuple2<Integer, Double> add(Stock value, Tuple2<Integer, Double> accumulator) {
return Tuple2.of(accumulator.f0 + 1, accumulator.f1 + value.getValue());
}
#Override
public Double getResult(Tuple2<Integer, Double> accumulator) {
return accumulator.f1 / accumulator.f0;
}
#Override
public Tuple2<Integer, Double> merge(Tuple2<Integer, Double> a, Tuple2<Integer, Double> b) {
return Tuple2.of(a.f0 + b.f0, a.f1 + b.f1);
}
});
final DataStream<Double> result = windowAggregation.filter((FilterFunction<Double>) value -> value > THRESHOLD);

How to avoid repeated tuples in Flink slide window join?

For example, there are two streams. One is advertisements showed to users. The tuple in which could be described as (advertiseId, showed timestamp). The other one is click stream -- (advertiseId, clicked timestamp). We want get a joined stream, which includes all the advertisement that is clicked by user in 20 minutes after showed. My solution is to join these two streams on a SlidingTimeWindow. But in the joined stream, there are many repeated tuples. How could I get joined tuple only one in new stream?
stream1.join(stream2)
.where(0)
.equalTo(0)
.window(SlidingTimeWindows.of(Time.of(30, TimeUnit.MINUTES), Time.of(10, TimeUnit.SECONDS)))
Solution 1:
Let flink support join two streams on separate windows like Spark streaming. In this case, implement SlidingTimeWindows(21 mins, 1 min) on advertisement stream and TupblingTimeWindows(1 min) on Click stream, then join these two windowed streams.
TupblingTimeWindows could avoid duplicate records in the joined stream.
21 mins size SlidingTimeWindows could avoid missing legal clicks.
One issue is there would be some illegal click(click after 20 mins) in the joined stream. This problem could be fixed easily by adding a filter.
MultiWindowsJoinedStreams<Tuple2<String, Long>, Tuple2<String, Long>> joinedStreams =
new MultiWindowsJoinedStreams<>(advertisement, click);
DataStream<Tuple3<String, Long, Long>> joinedStream = joinedStreams.where(keySelector)
.window(SlidingTimeWindows.of(Time.of(21, TimeUnit.SECONDS), Time.of(1, TimeUnit.SECONDS)))
.equalTo(keySelector)
.window(TumblingTimeWindows.of(Time.of(1, TimeUnit.SECONDS)))
.apply(new JoinFunction<Tuple2<String, Long>, Tuple2<String, Long>, Tuple3<String, Long, Long>>() {
private static final long serialVersionUID = -3625150954096822268L;
#Override
public Tuple3<String, Long, Long> join(Tuple2<String, Long> first, Tuple2<String, Long> second) throws Exception {
return new Tuple3<>(first.f0, first.f1, second.f1);
}
});
joinedStream = joinedStream.filter(new FilterFunction<Tuple3<String, Long, Long>>() {
private static final long serialVersionUID = -4325256210808325338L;
#Override
public boolean filter(Tuple3<String, Long, Long> value) throws Exception {
return value.f1<value.f2&&value.f1+20000>=value.f2;
}
});
Solution 2:
Flink supports join operation without window. A join operator implement the interface TwoInputStreamOperator keeps two buffers(time length based) of these two streams and output one joined stream.
DataStream<Tuple2<String, Long>> advertisement = env
.addSource(new FlinkKafkaConsumer082<String>("advertisement", new SimpleStringSchema(), properties))
.map(new MapFunction<String, Tuple2<String, Long>>() {
private static final long serialVersionUID = -6564495005753073342L;
#Override
public Tuple2<String, Long> map(String value) throws Exception {
String[] splits = value.split(" ");
return new Tuple2<String, Long>(splits[0], Long.parseLong(splits[1]));
}
}).keyBy(keySelector).assignTimestamps(timestampExtractor1);
DataStream<Tuple2<String, Long>> click = env
.addSource(new FlinkKafkaConsumer082<String>("click", new SimpleStringSchema(), properties))
.map(new MapFunction<String, Tuple2<String, Long>>() {
private static final long serialVersionUID = -6564495005753073342L;
#Override
public Tuple2<String, Long> map(String value) throws Exception {
String[] splits = value.split(" ");
return new Tuple2<String, Long>(splits[0], Long.parseLong(splits[1]));
}
}).keyBy(keySelector).assignTimestamps(timestampExtractor2);
NoWindowJoinedStreams<Tuple2<String, Long>, Tuple2<String, Long>> joinedStreams =
new NoWindowJoinedStreams<>(advertisement, click);
DataStream<Tuple3<String, Long, Long>> joinedStream = joinedStreams
.where(keySelector)
.buffer(Time.of(20, TimeUnit.SECONDS))
.equalTo(keySelector)
.buffer(Time.of(5, TimeUnit.SECONDS))
.apply(new JoinFunction<Tuple2<String, Long>, Tuple2<String, Long>, Tuple3<String, Long, Long>>() {
private static final long serialVersionUID = -5075871109025215769L;
#Override
public Tuple3<String, Long, Long> join(Tuple2<String, Long> first, Tuple2<String, Long> second) throws Exception {
return new Tuple3<>(first.f0, first.f1, second.f1);
}
});
I implemented two new join operators base on Flink streaming API TwoInputTransformation. Please check Flink-stream-join. I will add more tests to this repository.
On your code, you defined an overlapping sliding window (slide is smaller than window size). If you don't want to have duplicates you can define a non-overlapping window by only specifying the window size (the default slide is equal to the window size).
While searching for a solution for the same problem, I found the "Interval Join" very useful, which does not repeatedly output the same elements. This is the example from the Flink documentation:
DataStream<Integer> orangeStream = ...
DataStream<Integer> greenStream = ...
orangeStream
.keyBy(<KeySelector>)
.intervalJoin(greenStream.keyBy(<KeySelector>))
.between(Time.milliseconds(-2), Time.milliseconds(1))
.process (new ProcessJoinFunction<Integer, Integer, String(){
#Override
public void processElement(Integer left, Integer right, Context ctx, Collector<String> out) {
out.collect(first + "," + second);
}
});
With this no explicit window has to be defined, instead an interval that is used for each single element like this (image from Flink documentation):

Resources