Find stream of events that are not grouped using coGroupFunction - apache-flink

How can we find stream of events that are not matched with other events, when using CoGroupFunction?
Lets consider people are communicating over a phone call. In Tuple2<String, Integer>, f0 is name of person and f1 is phone number they are calling to OR receiving call from.
We have paired them using coGroup, however we are missing people who are getting calls from person outside the world.
final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);
DataStream<Tuple2<String, Integer>> callers = env.fromElements(
new Tuple2<String, Integer>("alice->", 12), // alice dials 12
new Tuple2<String, Integer>("bob->", 13), // bob dials 13
new Tuple2<String, Integer>("charlie->", 19))
.assignTimestampsAndWatermarks(new TimestampExtractor(Time.seconds(5)));
DataStream<Tuple2<String, Integer>> callees = env.fromElements(
new Tuple2<String, Integer>("->carl", 12), // carl received call
new Tuple2<String, Integer>("->ted", 13),
new Tuple2<String, Integer>("->chris", 7))
.assignTimestampsAndWatermarks(new TimestampExtractor(Time.seconds(5)));;
DataStream<Tuple1<String>> groupedStream = callers.coGroup(callees)
.where(evt -> evt.f1).equalTo(evt -> evt.f1)
.window(TumblingEventTimeWindows.of(Time.seconds(10)))
.apply(new IntEqualCoGroupFunc());
groupedStream.print(); // prints 1> (alice->-->carl) \n 1> (bob->-->ted)
//DataStream<Tuple1<String>> notGroupedStream = ..; // people without pairs in last window
//notGroupedStream.print(); // should print charlie->-->someone \n someone->-->chris
env.execute();

To be honest, the simplest solution seems to be changing the IntEqualCoGroupFunc, so that instead of String it returns (Boolean, String).
This is because coGroup processes also those elements that do not have matching keys, those elements will have one Iterable empty in the function coGroup(Iterable<IN1> first, Iterable<IN2> second, Collector<O> out) i.e. for Your case it would receive ("->chris", 7) as first and empty Iterable as second.
The change of the signature could allow You to easily emit also results that do not have matching keys and simply split them into the separate streams at later stage of processing.
// Implementation of IntEqualCoGroupFunc
#Override
public void coGroup(Iterable<Tuple2<String, Integer>> outbound, Iterable<Tuple2<String, Integer>> inbound,
Collector<Tuple1<String>> out) throws Exception {
for (Tuple2<String, Integer> outboundObj : outbound) {
for (Tuple2<String, Integer> inboundObj : inbound) {
out.collect(Tuple1.of(outboundObj.f0 + "-" + inboundObj.f0)); //matching pair
return;
}
out.collect(Tuple1.of(outboundObj.f0 + "->someone")); //inbound is empty
return;
}
// outbound is empty
for (Tuple2<String, Integer> inboundObj : inbound) {
out.collect(Tuple1.of("someone->-" + inboundObj.f0));
return;
}
//inbound also empty
out.collect(Tuple1.of("someone->-->someone"));
}
Output as follows:
2> (someone->-->chris)
2> (charlie->->someone)
1> (alice->-->carl)
1> (bob->-->ted)

Related

Flink Watermarks on Event Time

I'm trying to understand watermarks with Event Time.
My Code is similar with Flink Documentation WordCount example .
I did some changes to include timestamp on event and added watermarks.
Event format is: word;timestamp
The map function creates a tuple3 with word;1;timestamp.
Then it's assign a watermark strategy with timestamp assigner equals to event timestamp field.
For the following stream events:
test;1662128808294
test;1662128818065
test;1662128822434
test;1662128826434
test;1662128831175
test;1662128836581
I got the following result: (test,6) => This is correct, i sent 6 times test word.
But looking for context in ProcessFunction i see the following:
Processing Time: Fri Sep 02 15:27:20 WEST 2022
Watermark: Fri Sep 02 15:26:56 WEST 2022
Start Window: 2022 09 02 15:26:40 End Window: 2022 09 02 15:27:20
The window it's correct, it's 40 seconds window as defined, and the watermark also it's correct, it's 20 seconds less the last event timestamp (1662128836581 = Friday, September 2, 2022 3:27:16) as defined in watermark strategy.
My Question is the window processing time. The window fired exactly at end window time processing time, but shouldn't wait until watermark pass the end of window (something like processing time = end of window + 20 seconds) (Window Default Trigger Docs) ?
What i'm doing wrong? or i'm having a bad understanding about watermarks?
My Code:
public class DataStreamJob {
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(1);
WatermarkStrategy<Tuple3<String, Integer, Long>> strategy = WatermarkStrategy
.<Tuple3<String, Integer, Long>>forBoundedOutOfOrderness(Duration.ofSeconds(20))
.withTimestampAssigner((event, timestamp) -> event.f2);
DataStream<Tuple2<String, Integer>> dataStream = env
.socketTextStream("localhost", 9999)
.map(new Splitter())
.assignTimestampsAndWatermarks(strategy)
.keyBy(value -> value.f0)
.window(TumblingProcessingTimeWindows.of(Time.seconds(40)))
.process(new MyProcessWindowFunction());
dataStream.print();
env.execute("Window WordCount");
}
public static class Splitter extends RichMapFunction<String, Tuple3<String, Integer, Long>> {
#Override
public Tuple3<String, Integer, Long> map(String value) throws Exception {
String[] word = value.split(";");
return new Tuple3<String, Integer, Long>(word[0], 1, Long.parseLong(word[1]));
}
}
public static class MyProcessWindowFunction extends ProcessWindowFunction<Tuple3<String, Integer, Long>, Tuple2<String, Integer>, String, TimeWindow> {
#Override
public void process(String s, ProcessWindowFunction<Tuple3<String, Integer, Long>, Tuple2<String, Integer>, String, TimeWindow>.Context context, Iterable<Tuple3<String, Integer, Long>> elements, Collector<Tuple2<String, Integer>> out) throws Exception {
Integer sum = 0;
for (Tuple3<String, Integer, Long> in : elements) {
sum++;
}
out.collect(new Tuple2<String, Integer>(s, sum));
Date date = new Date(context.window().getStart());
Date date2 = new Date(context.window().getEnd());
Date watermark = new Date(context.currentWatermark());
Date processingTime = new Date(context.currentProcessingTime());
System.out.println(context.currentWatermark());
System.out.println("Processing Time: " + processingTime);
Format format = new SimpleDateFormat("yyyy MM dd HH:mm:ss");
System.out.println("Watermark: " + watermark);
System.out.println("Start Window: " + format.format(date) + " End Window: " + format.format(date2));
}
}
}
Thanks.
To get event time windows, you need to change
.window(TumblingProcessingTimeWindows.of(Time.seconds(40)))
to
.window(TumblingEventTimeWindows.of(Time.seconds(40)))

How to print an aggregated DataStream in flink?

I have a custom state calculation that is represented Set<Long> and it will keep getting updated as my Datastream<Set<Long>> sees new events from Kafka. Now, every time my state is updated I want to print the updated state to stdout. wondering how to do that in Flink? Little confused with all the window and trigger operations and I keep getting the following error.
Caused by: java.lang.RuntimeException: Record has Long.MIN_VALUE timestamp (= no timestamp marker). Is the time characteristic set to 'ProcessingTime', or did you forget to call 'DataStream.assignTimestampsAndWatermarks(...)'?
I just want to know how to print my aggregated stream Datastream<Set<Long>> to stdout or write it back to another kafka topic?
Below is the snippet of the code that throws the error.
StreamTableEnvironment bsTableEnv = StreamTableEnvironment.create(env, bsSettings);
DataStream<Set<Long>> stream = bsTableEnv.toAppendStream(kafkaSourceTable, Row.class)
stream
.aggregate(new MyCustomAggregation(100))
.process(new ProcessFunction<Set<Long>, Object>() {
#Override
public void processElement(Set<Long> value, Context ctx, Collector<Object> out) throws Exception {
System.out.println(value.toString());
}
});
Keeping collections in state with Flink can be very expensive, because in some cases the collection will be frequently serialized and deserialized. When possible it is preferred to the use Flink's built-in ListState and MapState types.
Here's an example illustrating a few things:
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env =
StreamExecutionEnvironment.getExecutionEnvironment();
env.fromElements(1L, 2L, 3L, 4L, 3L, 2L, 1L, 0L)
.keyBy(x -> 1)
.process(new KeyedProcessFunction<Integer, Long, List<Long>> () {
private transient MapState<Long, Boolean> set;
#Override
public void open(Configuration parameters) throws Exception {
set = getRuntimeContext().getMapState(new MapStateDescriptor<>("set", Long.class, Boolean.class));
}
#Override
public void processElement(Long x, Context context, Collector<List<Long>> out) throws Exception {
if (set.contains(x)) {
System.out.println("set contains " + x);
} else {
set.put(x, true);
List<Long> list = new ArrayList<>();
Iterator<Long> iter = set.keys().iterator();
iter.forEachRemaining(list::add);
out.collect(list);
}
}
})
.print();
env.execute();
}
Note that I wanted to use keyed state, but didn't have anything in the events to use as a key, so I just keyed the stream by a constant. This is normally not a good idea, as it prevents the processing from being done in parallel -- but since you are aggregating as a Set, that's not something you can do in parallel, so no harm done.
I'm representing the set of Longs as the keys of a MapState object. And when I want to output the set, I collect it as a List. When I just want to print something for debugging, I just use System.out.
What I see when I run this job in my IDE is this:
[1]
[1, 2]
[1, 2, 3]
[1, 2, 3, 4]
set contains 3
set contains 2
set contains 1
[0, 1, 2, 3, 4]
If you'd rather see what's in the MapState every second, you can use a Timer in the process function.

How to create batch or slide windows using Flink CEP?

I'm just starting with Flink CEP and I come from Esper CEP engine. As you may (or not) know, in Esper using their syntax (EPL) you can create a batch or slide window easily, grouping the events in those windows and allowing you to use this events with functions (avg, max, min, ...).
For example, with the following pattern you can create a batch windows of 5 seconds and calculate the average value of the attribute price of all the Stock events that you have received in that specified window.
select avg(price) from Stock#time_batch(5 sec)
The thing is I would like to know how to implement this on Flink CEP. I'm aware that, probably, the goal or approach in Flink CEP is different, so the way to implement this may not be as simple as in Esper CEP.
I have taken a look at the docs regarding to time windows, but I'm not able to implement this windows along with Flink CEP. So, given the following code:
DataStream<Stock> stream = ...; // Consume events from Kafka
// Filtering events with negative price
Pattern<Stock, ?> pattern = Pattern.<Stock>begin("start")
.where(
new SimpleCondition<Stock>() {
public boolean filter(Stock event) {
return event.getPrice() >= 0;
}
}
);
PatternStream<Stock> patternStream = CEP.pattern(stream, pattern);
/**
CREATE A BATCH WINDOW OF 5 SECONDS IN WHICH
I COMPUTE OVER THE AVERAGE PRICES AND, IF IT IS
GREATER THAN A THREESHOLD, AN ALERT IS DETECTED
return avg(allEventsInWindow.getPrice()) > 1;
*/
DataStream<Alert> result = patternStream.select(
new PatternSelectFunction<Stock, Alert>() {
#Override
public Alert select(Map<String, List<Stock>> pattern) throws Exception {
return new Alert(pattern.toString());
}
}
);
How can I create that window in which, from the first one received, I start to calculate the average for the following events within 5 seconds. For example:
t = 0 seconds
Stock(price = 1); (...starting batch window...)
Stock(price = 1);
Stock(price = 1);
Stock(price = 2);
Stock(price = 2);
Stock(price = 2);
t = 5 seconds (...end of batch window...)
Avg = 1.5 => Alert detected!
The average after 5 seconds would be 1.5, and will trigger the alert. How can I code this?
Thanks!
With Flink's CEP library this behavior is not expressible. I would rather recommend using Flink's DataStream or Table API to calculate the averages. Based on that you could again use CEP to generate other events.
final DataStream<Stock> input = env
.fromElements(
new Stock(1L, 1.0),
new Stock(2L, 2.0),
new Stock(3L, 1.0),
new Stock(4L, 2.0))
.assignTimestampsAndWatermarks(new BoundedOutOfOrdernessTimestampExtractor<Stock>(Time.seconds(0L)) {
#Override
public long extractTimestamp(Stock element) {
return element.getTimestamp();
}
});
final DataStream<Double> windowAggregation = input
.timeWindowAll(Time.milliseconds(2))
.aggregate(new AggregateFunction<Stock, Tuple2<Integer, Double>, Double>() {
#Override
public Tuple2<Integer, Double> createAccumulator() {
return Tuple2.of(0, 0.0);
}
#Override
public Tuple2<Integer, Double> add(Stock value, Tuple2<Integer, Double> accumulator) {
return Tuple2.of(accumulator.f0 + 1, accumulator.f1 + value.getValue());
}
#Override
public Double getResult(Tuple2<Integer, Double> accumulator) {
return accumulator.f1 / accumulator.f0;
}
#Override
public Tuple2<Integer, Double> merge(Tuple2<Integer, Double> a, Tuple2<Integer, Double> b) {
return Tuple2.of(a.f0 + b.f0, a.f1 + b.f1);
}
});
final DataStream<Double> result = windowAggregation.filter((FilterFunction<Double>) value -> value > THRESHOLD);

Can I print Individual elements of DataSteam<T> in Apache Flink without using inbuilt print() function

I am trying to Print the values of warnings that have been detected in Flink
// Generate temperature warnings for each matched warning pattern
DataStream<TemperatureEvent> warnings = tempPatternStream.select(
(Map<String, MonitoringEvent> pattern) -> {
TemperatureEvent first = (TemperatureEvent) pattern.get("first");
return new TemperatureEvent(first.getRackID(), first.getTemperature()) ;
}
);
// Print the warning and alert events to stdout
warnings.print();
I am getting output as below(as per toString of eventSource function)
Rack id = 99 and temprature = 76.0
Can someone tell me, if there is any way I can print the values of DataStream without using print? An example would be, if I only want to print temperature, how can I access Individual elements in DataStream.
Thanks in Advance
I have figured out a way to access individual elements, Lets assume we have a DataStream
HeartRate<Integer,Integer>
It has 2 attributes
private Integer Patient_id ;
private Integer HR;
// Generating a Datasteam using custom function
DataStream<HREvent> hrEventDataStream = envrionment
.addSource(new HRGenerator()).assignTimestampsAndWatermarks(new IngestionTimeExtractor<>());
Assuming that you have Generated a Datasteam using custom function ,now we can print the values of Individual Elements of HeartRateEvent as below
hrEventDataStream.keyBy(new KeySelector<HREvent, Integer>() {
#Override
public Integer getKey(HREvent hrEvent) throws Exception {
return hrEvent.getPatient_id();
}
})
.window(TumblingEventTimeWindows.of(milliseconds(10)))
.apply(new WindowFunction<HREvent, Object, Integer, TimeWindow>() {
#Override
public void apply(Integer integer, TimeWindow timeWindow, Iterable<HREvent> iterable, Collector<Object> collector) throws Exception {
for(HREvent in : iterable){
System.out.println("Patient id = " + in.getPatient_id() + " Heart Rate = " + in.getHR());
}//for
}//apply
});
Hope it Helps !

How to avoid repeated tuples in Flink slide window join?

For example, there are two streams. One is advertisements showed to users. The tuple in which could be described as (advertiseId, showed timestamp). The other one is click stream -- (advertiseId, clicked timestamp). We want get a joined stream, which includes all the advertisement that is clicked by user in 20 minutes after showed. My solution is to join these two streams on a SlidingTimeWindow. But in the joined stream, there are many repeated tuples. How could I get joined tuple only one in new stream?
stream1.join(stream2)
.where(0)
.equalTo(0)
.window(SlidingTimeWindows.of(Time.of(30, TimeUnit.MINUTES), Time.of(10, TimeUnit.SECONDS)))
Solution 1:
Let flink support join two streams on separate windows like Spark streaming. In this case, implement SlidingTimeWindows(21 mins, 1 min) on advertisement stream and TupblingTimeWindows(1 min) on Click stream, then join these two windowed streams.
TupblingTimeWindows could avoid duplicate records in the joined stream.
21 mins size SlidingTimeWindows could avoid missing legal clicks.
One issue is there would be some illegal click(click after 20 mins) in the joined stream. This problem could be fixed easily by adding a filter.
MultiWindowsJoinedStreams<Tuple2<String, Long>, Tuple2<String, Long>> joinedStreams =
new MultiWindowsJoinedStreams<>(advertisement, click);
DataStream<Tuple3<String, Long, Long>> joinedStream = joinedStreams.where(keySelector)
.window(SlidingTimeWindows.of(Time.of(21, TimeUnit.SECONDS), Time.of(1, TimeUnit.SECONDS)))
.equalTo(keySelector)
.window(TumblingTimeWindows.of(Time.of(1, TimeUnit.SECONDS)))
.apply(new JoinFunction<Tuple2<String, Long>, Tuple2<String, Long>, Tuple3<String, Long, Long>>() {
private static final long serialVersionUID = -3625150954096822268L;
#Override
public Tuple3<String, Long, Long> join(Tuple2<String, Long> first, Tuple2<String, Long> second) throws Exception {
return new Tuple3<>(first.f0, first.f1, second.f1);
}
});
joinedStream = joinedStream.filter(new FilterFunction<Tuple3<String, Long, Long>>() {
private static final long serialVersionUID = -4325256210808325338L;
#Override
public boolean filter(Tuple3<String, Long, Long> value) throws Exception {
return value.f1<value.f2&&value.f1+20000>=value.f2;
}
});
Solution 2:
Flink supports join operation without window. A join operator implement the interface TwoInputStreamOperator keeps two buffers(time length based) of these two streams and output one joined stream.
DataStream<Tuple2<String, Long>> advertisement = env
.addSource(new FlinkKafkaConsumer082<String>("advertisement", new SimpleStringSchema(), properties))
.map(new MapFunction<String, Tuple2<String, Long>>() {
private static final long serialVersionUID = -6564495005753073342L;
#Override
public Tuple2<String, Long> map(String value) throws Exception {
String[] splits = value.split(" ");
return new Tuple2<String, Long>(splits[0], Long.parseLong(splits[1]));
}
}).keyBy(keySelector).assignTimestamps(timestampExtractor1);
DataStream<Tuple2<String, Long>> click = env
.addSource(new FlinkKafkaConsumer082<String>("click", new SimpleStringSchema(), properties))
.map(new MapFunction<String, Tuple2<String, Long>>() {
private static final long serialVersionUID = -6564495005753073342L;
#Override
public Tuple2<String, Long> map(String value) throws Exception {
String[] splits = value.split(" ");
return new Tuple2<String, Long>(splits[0], Long.parseLong(splits[1]));
}
}).keyBy(keySelector).assignTimestamps(timestampExtractor2);
NoWindowJoinedStreams<Tuple2<String, Long>, Tuple2<String, Long>> joinedStreams =
new NoWindowJoinedStreams<>(advertisement, click);
DataStream<Tuple3<String, Long, Long>> joinedStream = joinedStreams
.where(keySelector)
.buffer(Time.of(20, TimeUnit.SECONDS))
.equalTo(keySelector)
.buffer(Time.of(5, TimeUnit.SECONDS))
.apply(new JoinFunction<Tuple2<String, Long>, Tuple2<String, Long>, Tuple3<String, Long, Long>>() {
private static final long serialVersionUID = -5075871109025215769L;
#Override
public Tuple3<String, Long, Long> join(Tuple2<String, Long> first, Tuple2<String, Long> second) throws Exception {
return new Tuple3<>(first.f0, first.f1, second.f1);
}
});
I implemented two new join operators base on Flink streaming API TwoInputTransformation. Please check Flink-stream-join. I will add more tests to this repository.
On your code, you defined an overlapping sliding window (slide is smaller than window size). If you don't want to have duplicates you can define a non-overlapping window by only specifying the window size (the default slide is equal to the window size).
While searching for a solution for the same problem, I found the "Interval Join" very useful, which does not repeatedly output the same elements. This is the example from the Flink documentation:
DataStream<Integer> orangeStream = ...
DataStream<Integer> greenStream = ...
orangeStream
.keyBy(<KeySelector>)
.intervalJoin(greenStream.keyBy(<KeySelector>))
.between(Time.milliseconds(-2), Time.milliseconds(1))
.process (new ProcessJoinFunction<Integer, Integer, String(){
#Override
public void processElement(Integer left, Integer right, Context ctx, Collector<String> out) {
out.collect(first + "," + second);
}
});
With this no explicit window has to be defined, instead an interval that is used for each single element like this (image from Flink documentation):

Resources