How to add a custom WatermarkGenerator to a WatermarkStrategy

How to add a custom WatermarkGenerator to a WatermarkStrategy - apache-flink

I'm using Apache Flink 1.11 and want to use some custom WatermarkGenerator.
With the Watermarkstrategy, you can add built-in WatermarkGenerators with ease:
WatermarkStrategy.forMonotonousTimestamps();
WatermarkStrategy.forBoundedOutOfOrderness(Duration.ofSeconds(10));
In the documentation, you can see how to implement a custom Watermarkgenerator, for example a Periodic WatermarkGenerator:
public class BoundedOutOfOrdernessGenerator implements WatermarkGenerator<MyEvent> {
private final long maxOutOfOrderness = 3500; // 3.5 seconds
private long currentMaxTimestamp;
#Override
public void onEvent(MyEvent event, long eventTimestamp, WatermarkOutput output) {
currentMaxTimestamp = Math.max(currentMaxTimestamp, eventTimestamp);
}
#Override
public void onPeriodicEmit(WatermarkOutput output) {
// emit the watermark as current highest timestamp minus the out-of-orderness bound
output.emitWatermark(new Watermark(currentMaxTimestamp - maxOutOfOrderness - 1));
}
}
How can i add this custom BoundedOutOfOrdernessGenerator to a Watermarkstrategy?

A WatermarkStrategy is the thing you need to define. So assuming you have some class MyWatermarkGenerator that implements WatermarkGenerator<MyEvent>, then you'd do something like:
WatermarkStrategy<WatermarkedRecord> ws = (ctx -> new MyWatermarkGenerator());
...
DataStream<MyEvent> ds = xxx;
ds.assignTimestampsAndWatermarks(ws);
Note that unless your source is setting up timestamps for you (e.g. Kafka record timestamps), you'll want to add a timestamp extractor to your WatermarkStrategy, as in...
WatermarkStrategy<WatermarkedRecord> ws = (ctx -> new MyWatermarkGenerator());
ws = ws.withTimestampAssigner((r, ts) -> r.getTimestamp());

Related

Upgrading Flink deprecated function calls

I am currently trying to upgrade a method call assignTimestampsAndWatermarks that is applied to a data stream. The data stream looks something like this:
DataStream<Auction> auctions = env.addSource(new AuctionSourceFunction(auctionSrcRates))
.name("Custom Source")
.setParallelism(params.getInt("p-auction-source", 1))
.assignTimestampsAndWatermarks(new AuctionTimestampAssigner());
The AssignerWithPeriodicWatermark looks like this:
private static final class AuctionTimestampAssigner implements AssignerWithPeriodicWatermarks<Auction> {
private long maxTimestamp = Long.MIN_VALUE;
#Nullable
#Override
public Watermark getCurrentWatermark() {
return new Watermark(maxTimestamp);
}
#Override
public long extractTimestamp(Auction element, long previousElementTimestamp) {
maxTimestamp = Math.max(maxTimestamp, element.dateTime);
return element.dateTime;
}
}
What are the steps I would need to take to upgrade from deprecated calls to the current best practices? Thanks.

Your watermark generator assumes that the events are in order, by timestamp, or at least accepts that any out-of-order events will be late. This is equivalent to
assignTimestampsAndWatermarks(
WatermarkStrategy
.<Auction>forMonotonousTimestamps()
.withTimestampAssigner((event, timestamp) -> event.dateTime))

Change in Behavior for EventTimeSessionWindows from Flink 1.11.1 to 1.14.0

I observed what appears to be a change in behavior for EventTimeSessionWindows when upgrading from 1.11.1 to 1.14.0. This was identified in a unit test.
For a window with a defined time gap of 10 seconds.
Publish KEY_1 with eventtime 1 second
Publish KEY_1 with eventtime 3 seconds
Publish KEY_1 with eventtime 2 seconds
Publish KEY_2 with eventtime 101 seconds
For Flink 1.11.1 the window for KEY_1 closes, reduces, and publishes, supposedly because KEY_2 had an event time greater than 10 seconds after the last message in KEY_1's window. KEY_2 window would also not close. In the absence of KEY_2 the KEY_1 window would not close.
For Flink 1.14.0 the main difference is that the window for KEY_2 DOES close even though there are no new messages after 111 seconds.
This appears to be a change in behavior. The nearest I could find was https://issues.apache.org/jira/browse/FLINK-20443 but that’s in 1.14.1. I also noticed https://issues.apache.org/jira/browse/FLINK-19777 which was in 1.11.3 but couldn't ascertain if that would have resulted in this behavior. Is there an explanation for this change in behavior? Is it expected or desirable? Is it because all pending windows are automatically closed based on an updated trigger behavior?
I tested the same behavior for ProcessingTimeSessionWindows and did not observe a similar change in behavior.
Thanks.
Jai
#Test
public void testEventTime() throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
// configure your test environment
env.setParallelism(1);
env.getConfig().registerTypeWithKryoSerializer(Document.class, ProtobufSerializer.class);
// values are collected in a static variable
CollectSink.values.clear();
// create a stream of custom elements and apply transformations
SingleOutputStreamOperator<Tuple2<String, Document>> inputStream = buildStream(env, this.generateTestOrders());
SingleOutputStreamOperator<Tuple2<String, Document>> intermediateStream = this.documentDebounceFunction.insertIntoPipeline(inputStream);
intermediateStream.addSink(new CollectSink());
// execute
env.execute();
// verify your results
Assertions.assertEquals(1, CollectSink.values.size());
Map<String, Long> expectedVersions = Maps.newHashMap();
expectedVersions.put(KEY_1, 2L);
for (Tuple2<String, Document> actual : CollectSink.values) {
Assertions.assertEquals(expectedVersions.get(actual.f0), actual.f1.getVersion());
}
}
// create a testing sink
private static class CollectSink implements SinkFunction<Tuple2<String, Document>> {
// must be static
public static final List<Tuple2<String, Document>> values = Collections.synchronizedList(new ArrayList<>());
#Override
public void invoke(Tuple2<String, Document> value, SinkFunction.Context context) {
values.add(value);
}
}
public List<Tuple2<String, Document>> generateTestOrders() {
List<Tuple2<String, Document>> testMessages = Lists.newArrayList();
// KEY_1
testMessages.add(
Tuple2.of(
KEY_1,
Document.newBuilder()
.setVersion(1)
.setUpdatedAt(
Timestamp.newBuilder().setSeconds(1).build())
.build()));
testMessages.add(
Tuple2.of(
KEY_1,
Document.newBuilder()
.setVersion(2)
.setUpdatedAt(
Timestamp.newBuilder().setSeconds(3).build())
.build()));
testMessages.add(
Tuple2.of(
KEY_1,
Document.newBuilder()
.setVersion(3)
.setUpdatedAt(
Timestamp.newBuilder().setSeconds(2).build())
.build()));
// KEY_2 -- WAY IN THE FUTURE
testMessages.add(
Tuple2.of(
KEY_2,
Document.newBuilder()
.setVersion(15)
.setUpdatedAt(
Timestamp.newBuilder().setSeconds(101).build())
.build()));
return ImmutableList.copyOf(testMessages);
}
private SingleOutputStreamOperator<Tuple2<String, Document>> buildStream(
StreamExecutionEnvironment executionEnvironment,
List<Tuple2<String, Document>> inputMessages) {
inputMessages =
inputMessages.stream()
.sorted(
Comparator.comparingInt(
msg -> (int) ProtobufTypeConversion.toMillis(msg.f1.getUpdatedAt())))
.collect(Collectors.toList());
WatermarkStrategy<Tuple2<String, Document>> watermarkStrategy =
WatermarkStrategy.forMonotonousTimestamps();
return executionEnvironment
.fromCollection(
inputMessages, TypeInformation.of(new TypeHint<Tuple2<String, Document>>() {}))
.assignTimestampsAndWatermarks(
watermarkStrategy.withTimestampAssigner(
(event, timestamp) -> Timestamps.toMillis(event.f1.getUpdatedAt())));
}

is JSONDeserializationSchema() deprecated in Flink?

I am new to Flink and doing something very similar to the below link.
Cannot see message while sinking kafka stream and cannot see print message in flink 1.2
I am also trying to add JSONDeserializationSchema() as a deserializer for my Kafka input JSON message which is without a key.
But I found JSONDeserializationSchema() is not present.
Please let me know if I am doing anything wrong.

JSONDeserializationSchema was removed in Flink 1.8, after having been deprecated earlier.
The recommended approach is to write a deserializer that implements DeserializationSchema<T>. Here's an example, which I've copied from the Flink Operations Playground:
import org.apache.flink.api.common.serialization.DeserializationSchema;
import org.apache.flink.api.common.typeinfo.TypeInformation;
import org.apache.flink.shaded.jackson2.com.fasterxml.jackson.databind.ObjectMapper;
import java.io.IOException;
/**
* A Kafka {#link DeserializationSchema} to deserialize {#link ClickEvent}s from JSON.
*
*/
public class ClickEventDeserializationSchema implements DeserializationSchema<ClickEvent> {
private static final long serialVersionUID = 1L;
private static final ObjectMapper objectMapper = new ObjectMapper();
#Override
public ClickEvent deserialize(byte[] message) throws IOException {
return objectMapper.readValue(message, ClickEvent.class);
}
#Override
public boolean isEndOfStream(ClickEvent nextElement) {
return false;
}
#Override
public TypeInformation<ClickEvent> getProducedType() {
return TypeInformation.of(ClickEvent.class);
}
}
For a Kafka producer you'll want to implement KafkaSerializationSchema<T>, and you'll find examples of that in that same project.

To solve the problem of reading non-key JSON messages from Kafka I used case class and JSON parser.
The following code makes a case class and parses the JSON field using play API.
import play.api.libs.json.JsValue
object CustomerModel {
def readElement(jsonElement: JsValue): Customer = {
val id = (jsonElement \ "id").get.toString().toInt
val name = (jsonElement \ "name").get.toString()
Customer(id,name)
}
case class Customer(id: Int, name: String)
}
def main(args: Array[String]): Unit = {
val env = StreamExecutionEnvironment.getExecutionEnvironment
val properties = new Properties()
properties.setProperty("bootstrap.servers", "xxx.xxx.0.114:9092")
properties.setProperty("group.id", "test-grp")
val consumer = new FlinkKafkaConsumer[String]("customer", new SimpleStringSchema(), properties)
val stream1 = env.addSource(consumer).rebalance
val stream2:DataStream[Customer]= stream1.map( str =>{Try(CustomerModel.readElement(Json.parse(str))).getOrElse(Customer(0,Try(CustomerModel.readElement(Json.parse(str))).toString))
})
stream2.print("stream2")
env.execute("This is Kafka+Flink")
}
The Try method lets you overcome the exception thrown while parsing the data
and returns the exception in one of the fields (if we want) or else it can just return the case class object with any given or default fields.
The sample output of the Code is:
stream2:1> Customer(1,"Thanh")
stream2:1> Customer(5,"Huy")
stream2:3> Customer(0,Failure(com.fasterxml.jackson.databind.JsonMappingException: No content to map due to end-of-input
at [Source: ; line: 1, column: 0]))
I am not sure if it is the best approach but it is working for me as of now.

How to create batch or slide windows using Flink CEP?

I'm just starting with Flink CEP and I come from Esper CEP engine. As you may (or not) know, in Esper using their syntax (EPL) you can create a batch or slide window easily, grouping the events in those windows and allowing you to use this events with functions (avg, max, min, ...).
For example, with the following pattern you can create a batch windows of 5 seconds and calculate the average value of the attribute price of all the Stock events that you have received in that specified window.
select avg(price) from Stock#time_batch(5 sec)
The thing is I would like to know how to implement this on Flink CEP. I'm aware that, probably, the goal or approach in Flink CEP is different, so the way to implement this may not be as simple as in Esper CEP.
I have taken a look at the docs regarding to time windows, but I'm not able to implement this windows along with Flink CEP. So, given the following code:
DataStream<Stock> stream = ...; // Consume events from Kafka
// Filtering events with negative price
Pattern<Stock, ?> pattern = Pattern.<Stock>begin("start")
.where(
new SimpleCondition<Stock>() {
public boolean filter(Stock event) {
return event.getPrice() >= 0;
}
}
);
PatternStream<Stock> patternStream = CEP.pattern(stream, pattern);
/**
CREATE A BATCH WINDOW OF 5 SECONDS IN WHICH
I COMPUTE OVER THE AVERAGE PRICES AND, IF IT IS
GREATER THAN A THREESHOLD, AN ALERT IS DETECTED
return avg(allEventsInWindow.getPrice()) > 1;
*/
DataStream<Alert> result = patternStream.select(
new PatternSelectFunction<Stock, Alert>() {
#Override
public Alert select(Map<String, List<Stock>> pattern) throws Exception {
return new Alert(pattern.toString());
}
}
);
How can I create that window in which, from the first one received, I start to calculate the average for the following events within 5 seconds. For example:
t = 0 seconds
Stock(price = 1); (...starting batch window...)
Stock(price = 1);
Stock(price = 1);
Stock(price = 2);
Stock(price = 2);
Stock(price = 2);
t = 5 seconds (...end of batch window...)
Avg = 1.5 => Alert detected!
The average after 5 seconds would be 1.5, and will trigger the alert. How can I code this?
Thanks!

With Flink's CEP library this behavior is not expressible. I would rather recommend using Flink's DataStream or Table API to calculate the averages. Based on that you could again use CEP to generate other events.
final DataStream<Stock> input = env
.fromElements(
new Stock(1L, 1.0),
new Stock(2L, 2.0),
new Stock(3L, 1.0),
new Stock(4L, 2.0))
.assignTimestampsAndWatermarks(new BoundedOutOfOrdernessTimestampExtractor<Stock>(Time.seconds(0L)) {
#Override
public long extractTimestamp(Stock element) {
return element.getTimestamp();
}
});
final DataStream<Double> windowAggregation = input
.timeWindowAll(Time.milliseconds(2))
.aggregate(new AggregateFunction<Stock, Tuple2<Integer, Double>, Double>() {
#Override
public Tuple2<Integer, Double> createAccumulator() {
return Tuple2.of(0, 0.0);
}
#Override
public Tuple2<Integer, Double> add(Stock value, Tuple2<Integer, Double> accumulator) {
return Tuple2.of(accumulator.f0 + 1, accumulator.f1 + value.getValue());
}
#Override
public Double getResult(Tuple2<Integer, Double> accumulator) {
return accumulator.f1 / accumulator.f0;
}
#Override
public Tuple2<Integer, Double> merge(Tuple2<Integer, Double> a, Tuple2<Integer, Double> b) {
return Tuple2.of(a.f0 + b.f0, a.f1 + b.f1);
}
});
final DataStream<Double> result = windowAggregation.filter((FilterFunction<Double>) value -> value > THRESHOLD);

How to avoid repeated tuples in Flink slide window join?

For example, there are two streams. One is advertisements showed to users. The tuple in which could be described as (advertiseId, showed timestamp). The other one is click stream -- (advertiseId, clicked timestamp). We want get a joined stream, which includes all the advertisement that is clicked by user in 20 minutes after showed. My solution is to join these two streams on a SlidingTimeWindow. But in the joined stream, there are many repeated tuples. How could I get joined tuple only one in new stream?
stream1.join(stream2)
.where(0)
.equalTo(0)
.window(SlidingTimeWindows.of(Time.of(30, TimeUnit.MINUTES), Time.of(10, TimeUnit.SECONDS)))

Solution 1:
Let flink support join two streams on separate windows like Spark streaming. In this case, implement SlidingTimeWindows(21 mins, 1 min) on advertisement stream and TupblingTimeWindows(1 min) on Click stream, then join these two windowed streams.
TupblingTimeWindows could avoid duplicate records in the joined stream.
21 mins size SlidingTimeWindows could avoid missing legal clicks.
One issue is there would be some illegal click(click after 20 mins) in the joined stream. This problem could be fixed easily by adding a filter.
MultiWindowsJoinedStreams<Tuple2<String, Long>, Tuple2<String, Long>> joinedStreams =
new MultiWindowsJoinedStreams<>(advertisement, click);
DataStream<Tuple3<String, Long, Long>> joinedStream = joinedStreams.where(keySelector)
.window(SlidingTimeWindows.of(Time.of(21, TimeUnit.SECONDS), Time.of(1, TimeUnit.SECONDS)))
.equalTo(keySelector)
.window(TumblingTimeWindows.of(Time.of(1, TimeUnit.SECONDS)))
.apply(new JoinFunction<Tuple2<String, Long>, Tuple2<String, Long>, Tuple3<String, Long, Long>>() {
private static final long serialVersionUID = -3625150954096822268L;
#Override
public Tuple3<String, Long, Long> join(Tuple2<String, Long> first, Tuple2<String, Long> second) throws Exception {
return new Tuple3<>(first.f0, first.f1, second.f1);
}
});
joinedStream = joinedStream.filter(new FilterFunction<Tuple3<String, Long, Long>>() {
private static final long serialVersionUID = -4325256210808325338L;
#Override
public boolean filter(Tuple3<String, Long, Long> value) throws Exception {
return value.f1<value.f2&&value.f1+20000>=value.f2;
}
});
Solution 2:
Flink supports join operation without window. A join operator implement the interface TwoInputStreamOperator keeps two buffers(time length based) of these two streams and output one joined stream.
DataStream<Tuple2<String, Long>> advertisement = env
.addSource(new FlinkKafkaConsumer082<String>("advertisement", new SimpleStringSchema(), properties))
.map(new MapFunction<String, Tuple2<String, Long>>() {
private static final long serialVersionUID = -6564495005753073342L;
#Override
public Tuple2<String, Long> map(String value) throws Exception {
String[] splits = value.split(" ");
return new Tuple2<String, Long>(splits[0], Long.parseLong(splits[1]));
}
}).keyBy(keySelector).assignTimestamps(timestampExtractor1);
DataStream<Tuple2<String, Long>> click = env
.addSource(new FlinkKafkaConsumer082<String>("click", new SimpleStringSchema(), properties))
.map(new MapFunction<String, Tuple2<String, Long>>() {
private static final long serialVersionUID = -6564495005753073342L;
#Override
public Tuple2<String, Long> map(String value) throws Exception {
String[] splits = value.split(" ");
return new Tuple2<String, Long>(splits[0], Long.parseLong(splits[1]));
}
}).keyBy(keySelector).assignTimestamps(timestampExtractor2);
NoWindowJoinedStreams<Tuple2<String, Long>, Tuple2<String, Long>> joinedStreams =
new NoWindowJoinedStreams<>(advertisement, click);
DataStream<Tuple3<String, Long, Long>> joinedStream = joinedStreams
.where(keySelector)
.buffer(Time.of(20, TimeUnit.SECONDS))
.equalTo(keySelector)
.buffer(Time.of(5, TimeUnit.SECONDS))
.apply(new JoinFunction<Tuple2<String, Long>, Tuple2<String, Long>, Tuple3<String, Long, Long>>() {
private static final long serialVersionUID = -5075871109025215769L;
#Override
public Tuple3<String, Long, Long> join(Tuple2<String, Long> first, Tuple2<String, Long> second) throws Exception {
return new Tuple3<>(first.f0, first.f1, second.f1);
}
});
I implemented two new join operators base on Flink streaming API TwoInputTransformation. Please check Flink-stream-join. I will add more tests to this repository.

On your code, you defined an overlapping sliding window (slide is smaller than window size). If you don't want to have duplicates you can define a non-overlapping window by only specifying the window size (the default slide is equal to the window size).

While searching for a solution for the same problem, I found the "Interval Join" very useful, which does not repeatedly output the same elements. This is the example from the Flink documentation:
DataStream<Integer> orangeStream = ...
DataStream<Integer> greenStream = ...
orangeStream
.keyBy(<KeySelector>)
.intervalJoin(greenStream.keyBy(<KeySelector>))
.between(Time.milliseconds(-2), Time.milliseconds(1))
.process (new ProcessJoinFunction<Integer, Integer, String(){
#Override
public void processElement(Integer left, Integer right, Context ctx, Collector<String> out) {
out.collect(first + "," + second);
}
});
With this no explicit window has to be defined, instead an interval that is used for each single element like this (image from Flink documentation):

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

How to add a custom WatermarkGenerator to a WatermarkStrategy - apache-flink

Related

Upgrading Flink deprecated function calls

Change in Behavior for EventTimeSessionWindows from Flink 1.11.1 to 1.14.0

is JSONDeserializationSchema() deprecated in Flink?

How to create batch or slide windows using Flink CEP?

How to avoid repeated tuples in Flink slide window join?

Categories

Resources