Upgrading Flink deprecated function calls - apache-flink

I am currently trying to upgrade a method call assignTimestampsAndWatermarks that is applied to a data stream. The data stream looks something like this:
DataStream<Auction> auctions = env.addSource(new AuctionSourceFunction(auctionSrcRates))
.name("Custom Source")
.setParallelism(params.getInt("p-auction-source", 1))
.assignTimestampsAndWatermarks(new AuctionTimestampAssigner());
The AssignerWithPeriodicWatermark looks like this:
private static final class AuctionTimestampAssigner implements AssignerWithPeriodicWatermarks<Auction> {
private long maxTimestamp = Long.MIN_VALUE;
#Nullable
#Override
public Watermark getCurrentWatermark() {
return new Watermark(maxTimestamp);
}
#Override
public long extractTimestamp(Auction element, long previousElementTimestamp) {
maxTimestamp = Math.max(maxTimestamp, element.dateTime);
return element.dateTime;
}
}
What are the steps I would need to take to upgrade from deprecated calls to the current best practices? Thanks.

Your watermark generator assumes that the events are in order, by timestamp, or at least accepts that any out-of-order events will be late. This is equivalent to
assignTimestampsAndWatermarks(
WatermarkStrategy
.<Auction>forMonotonousTimestamps()
.withTimestampAssigner((event, timestamp) -> event.dateTime))

Related

Huge checkpoint size using ValueState leading to event processing lag

I have an application in flink, which does deduplication of multiple streams.
It does key-by on one string field and dedupes it by using value-state.
Using value state in RichFilterFunction.
public class DedupeWithState extends RichFilterFunction<Tuple2<String, Message>> {
private ValueState<Boolean> seen;
private final ValueStateDescriptor<Boolean> desc;
public DedupeWithState(long cacheExpirationTimeMs) {
StateTtlConfig ttlConfig = StateTtlConfig
.newBuilder(Time.milliseconds(cacheExpirationTimeMs))
.setUpdateType(StateTtlConfig.UpdateType.OnCreateAndWrite)
.setStateVisibility(StateTtlConfig.StateVisibility.NeverReturnExpired)
.build();
desc = new ValueStateDescriptor<>("seen", Types.BOOLEAN);
desc.enableTimeToLive(ttlConfig);
}
#Override
public void open(Configuration conf) {
seen = getRuntimeContext().getState(desc);
}
#Override
public boolean filter(Tuple2<String, Message> stringMessageTuple2) throws Exception {
if (seen.value() == null) {
seen.update(true);
return true;
}
return false;
}
}
The application consumes 3 streams from kafka, and each stream has its own dedupe function with ttl of 4hours.
DataStream<Tuple2<String, Message>> event1 = event1Input
.keyBy(x->x.f0)
.filter(new DedupeWithState(14400000));
DataStream<Tuple2<String, Message>> event2 = event2Input
.keyBy(x->x.f0)
.filter(new DedupeWithState(14400000));
DataStream<Tuple2<String, Message>> event3 = event3Input
.keyBy(x->x.f0)
.filter(new DedupeWithState(14400000));
Screenshots attached.
Backend properties are:
state.backend: rocksdb
state.backend.incremental: true
state.checkpoints.dir: <azure blob store>
Checkpoint configuration as on WebUI
We are using Flink 1.13.6.
The QPS of each stream is event1 - 7k, event2 - 6k, event3 - 200
Key size is ~110 bytes
Checkpoint interval is 5 mins and incremental checkpoint is enabled.
As per above configs (given that incremental checkpoint is enabled) each stream should have following checkpoint size:
event1 -> ((7000 * 60 * 5) * 110bytes) = ~220MB
Issue is the checkpoint size is very huge. It starts from 400 MB (as expected) but is going upto 2-3GB per checkpoint Checkpoint history. This results in huge backpressure in Dedupe function and overall lag in the system. Checkpoint per operator
Maybe the state is not being cleaned since it is done lazily (on read). From the initial release post (a bit old but may still stand):
When a state object is accessed in a read operation, Flink will check its timestamp and clear the state if it is expired (depending on the configured state visibility, the expired state is returned or not). Due to this lazy removal, expired state that is never accessed again will forever occupy storage space unless it is garbage collected.
I would try with a MapState per stream (without keying by) instead of a ValueState per key as you have now so the same state is continuously accessed. Or you may also be able to set up a timer in DedupeWithState that accesses the state and forces the cleanup (you may need to use a ProcessFunction to be able to set up timers) or that simply clears it.
Try something like this -
/**
* #author sucheth.shivakumar
*/
public class Check extends KeyedProcessFunction {
private ValueState<Boolean> seen;
#Override
public void open(Configuration parameters) throws Exception {
ValueStateDescriptor<Boolean> desc = new ValueStateDescriptor<>("seen", Types.BOOLEAN);
// defines the time the state has to be stored in the state backend before it is auto cleared
seen = getRuntimeContext().getState(desc);
}
#Override
public void processElement(Object value, Context ctx, Collector out) throws Exception {
if (seen.value() == null) {
seen.update(true);
// emits the record
out.collect(stringMessageTuple2);
ctx.timerService().registerProcessingTimeTimer(ctx.timestamp() + 14400000);
}
}
#Override
// this fires after 4 hrs is passed and clears the state
public void onTimer(long timestamp, OnTimerContext ctx, Collector out)
throws Exception {
// triggers after ttl has passed
if (seen.value()) {
seen.clear();
}
}

Absence of event in Apache Flink CEP

I'm new at Apache Flink CEP and I'm struggle trying to detect a simple absence of event.
What I'm trying to detect is wheter an event of type CurrencyEvent with a certain id does not occur in certain amount of time. I would like to detect the absence of such event every time that after 3000ms the event does not occur.
My pattern code looks as follows:
Pattern<CurrencyEvent, ?> myPattern = Pattern.<Event>begin("CurrencyEvent")
.subtype(CurrencyEvent.class)
.where(new SimpleCondition<CurrencyEvent>() {
#Override
public boolean filter(CurrencyEvent currencyEvent) throws Exception {
return currencyEvent.getId().equalsIgnoreCase("usd");
}
})
.within(Time.milliseconds(3000L));
So now my idea is to use timeout functions in order to detect timeout events:
DataStreamSource<Event> events = env.addSource(new TestSource(
Arrays.asList(
basicCurrencyWithMivLevelEvent("EUR", 100L, Arrays.asList("1", "2"), 200D),
basicCurrencyWithMivLevelEvent("USD", 100L, Arrays.asList("1", "2"), 200D),
basicCurrencyWithMivLevelEvent("EUR", 100L, Arrays.asList("1", "2"), 200D)
),
1636040364820L, // initial timestamp for the first element
7000 // 7 seconds between each event
));
PatternStream<Event> patternStream = CEP.pattern(
events,
(Pattern<Event, ?>) myPattern
);
OutputTag<Alarm> tag = new OutputTag<Alarm>("currency-timeout"){};
PatternFlatTimeoutFunction<Event, Alarm> eventAlarmTimeoutPatternFunction = (patterns, timestamp, ctx) -> {
System.out.println("New alarm, since after 3 seconds an event with id=usd is not detected");
//TODO: call collect
};
PatternFlatSelectFunction<Event, Alarm> eventAlarmPatternSelectFunction = (patterns, ctx) -> {
System.out.println("Select! (we can ignore it) " + patterns);
// ignore matched events
};
return patternStream.flatSelect(
tag,
eventAlarmTimeoutPatternFunction,
TypeInformation.of(Alarm.class),
eventAlarmPatternSelectFunction
);
My Test source is using event timestamps and watermarks, as shown as follows:
public class TestSource implements SourceFunction<Event> {
private final List<Event> events;
private final long initialTimestamp;
private final long timeBetweenInMillis;
public TestSource(List<Event> events, long initialTimestamp, long timeBetweenInMillis){
this.events = events;
this.initialTimestamp = initialTimestamp;
this.timeBetweenInMillis = timeBetweenInMillis;
}
#Override
public void run(SourceContext<Event> sourceContext) throws InterruptedException {
long timestamp = this.initialTimestamp;
for(Event event: this.events){
sourceContext.collectWithTimestamp(event, timestamp);
sourceContext.emitWatermark(new Watermark(timestamp));
timestamp+=this.timeBetweenInMillis;
}
}
#Override
public void cancel() {
}
}
I'm using TimeCharacteristics.EventTime.
Since the the window time (3seconds) is lower than the event time difference between every event (7 seconds), I expect to get some timeout events, but I'm getting 0.
A CEP Pattern matches a sequence of one or more events; the within(interval) clause adds an additional constraint that all of the events in the sequence must occur within the specified interval. When partial matches time out, this can be captured in a TimedOutPartialMatchHandler.
In your case, since a successfully matched Pattern consists of a single event, there can be no partial matches, and a match can never time out. (Your matching sequences are always less than 3 seconds long.)
What you can do is to extend the pattern definition to include a second event, so that to match there must be a start event followed by another event within 3 seconds. When that second event is missing, then you will have a partial match that times out.
For more flexibility than what CEP offers for implementing use cases involving missing events, you can use a KeyedProcessFunction with timers.

How to add a custom WatermarkGenerator to a WatermarkStrategy

I'm using Apache Flink 1.11 and want to use some custom WatermarkGenerator.
With the Watermarkstrategy, you can add built-in WatermarkGenerators with ease:
WatermarkStrategy.forMonotonousTimestamps();
WatermarkStrategy.forBoundedOutOfOrderness(Duration.ofSeconds(10));
In the documentation, you can see how to implement a custom Watermarkgenerator, for example a Periodic WatermarkGenerator:
public class BoundedOutOfOrdernessGenerator implements WatermarkGenerator<MyEvent> {
private final long maxOutOfOrderness = 3500; // 3.5 seconds
private long currentMaxTimestamp;
#Override
public void onEvent(MyEvent event, long eventTimestamp, WatermarkOutput output) {
currentMaxTimestamp = Math.max(currentMaxTimestamp, eventTimestamp);
}
#Override
public void onPeriodicEmit(WatermarkOutput output) {
// emit the watermark as current highest timestamp minus the out-of-orderness bound
output.emitWatermark(new Watermark(currentMaxTimestamp - maxOutOfOrderness - 1));
}
}
How can i add this custom BoundedOutOfOrdernessGenerator to a Watermarkstrategy?
A WatermarkStrategy is the thing you need to define. So assuming you have some class MyWatermarkGenerator that implements WatermarkGenerator<MyEvent>, then you'd do something like:
WatermarkStrategy<WatermarkedRecord> ws = (ctx -> new MyWatermarkGenerator());
...
DataStream<MyEvent> ds = xxx;
ds.assignTimestampsAndWatermarks(ws);
Note that unless your source is setting up timestamps for you (e.g. Kafka record timestamps), you'll want to add a timestamp extractor to your WatermarkStrategy, as in...
WatermarkStrategy<WatermarkedRecord> ws = (ctx -> new MyWatermarkGenerator());
ws = ws.withTimestampAssigner((r, ts) -> r.getTimestamp());

is JSONDeserializationSchema() deprecated in Flink?

I am new to Flink and doing something very similar to the below link.
Cannot see message while sinking kafka stream and cannot see print message in flink 1.2
I am also trying to add JSONDeserializationSchema() as a deserializer for my Kafka input JSON message which is without a key.
But I found JSONDeserializationSchema() is not present.
Please let me know if I am doing anything wrong.
JSONDeserializationSchema was removed in Flink 1.8, after having been deprecated earlier.
The recommended approach is to write a deserializer that implements DeserializationSchema<T>. Here's an example, which I've copied from the Flink Operations Playground:
import org.apache.flink.api.common.serialization.DeserializationSchema;
import org.apache.flink.api.common.typeinfo.TypeInformation;
import org.apache.flink.shaded.jackson2.com.fasterxml.jackson.databind.ObjectMapper;
import java.io.IOException;
/**
* A Kafka {#link DeserializationSchema} to deserialize {#link ClickEvent}s from JSON.
*
*/
public class ClickEventDeserializationSchema implements DeserializationSchema<ClickEvent> {
private static final long serialVersionUID = 1L;
private static final ObjectMapper objectMapper = new ObjectMapper();
#Override
public ClickEvent deserialize(byte[] message) throws IOException {
return objectMapper.readValue(message, ClickEvent.class);
}
#Override
public boolean isEndOfStream(ClickEvent nextElement) {
return false;
}
#Override
public TypeInformation<ClickEvent> getProducedType() {
return TypeInformation.of(ClickEvent.class);
}
}
For a Kafka producer you'll want to implement KafkaSerializationSchema<T>, and you'll find examples of that in that same project.
To solve the problem of reading non-key JSON messages from Kafka I used case class and JSON parser.
The following code makes a case class and parses the JSON field using play API.
import play.api.libs.json.JsValue
object CustomerModel {
def readElement(jsonElement: JsValue): Customer = {
val id = (jsonElement \ "id").get.toString().toInt
val name = (jsonElement \ "name").get.toString()
Customer(id,name)
}
case class Customer(id: Int, name: String)
}
def main(args: Array[String]): Unit = {
val env = StreamExecutionEnvironment.getExecutionEnvironment
val properties = new Properties()
properties.setProperty("bootstrap.servers", "xxx.xxx.0.114:9092")
properties.setProperty("group.id", "test-grp")
val consumer = new FlinkKafkaConsumer[String]("customer", new SimpleStringSchema(), properties)
val stream1 = env.addSource(consumer).rebalance
val stream2:DataStream[Customer]= stream1.map( str =>{Try(CustomerModel.readElement(Json.parse(str))).getOrElse(Customer(0,Try(CustomerModel.readElement(Json.parse(str))).toString))
})
stream2.print("stream2")
env.execute("This is Kafka+Flink")
}
The Try method lets you overcome the exception thrown while parsing the data
and returns the exception in one of the fields (if we want) or else it can just return the case class object with any given or default fields.
The sample output of the Code is:
stream2:1> Customer(1,"Thanh")
stream2:1> Customer(5,"Huy")
stream2:3> Customer(0,Failure(com.fasterxml.jackson.databind.JsonMappingException: No content to map due to end-of-input
at [Source: ; line: 1, column: 0]))
I am not sure if it is the best approach but it is working for me as of now.

How to avoid repeated tuples in Flink slide window join?

For example, there are two streams. One is advertisements showed to users. The tuple in which could be described as (advertiseId, showed timestamp). The other one is click stream -- (advertiseId, clicked timestamp). We want get a joined stream, which includes all the advertisement that is clicked by user in 20 minutes after showed. My solution is to join these two streams on a SlidingTimeWindow. But in the joined stream, there are many repeated tuples. How could I get joined tuple only one in new stream?
stream1.join(stream2)
.where(0)
.equalTo(0)
.window(SlidingTimeWindows.of(Time.of(30, TimeUnit.MINUTES), Time.of(10, TimeUnit.SECONDS)))
Solution 1:
Let flink support join two streams on separate windows like Spark streaming. In this case, implement SlidingTimeWindows(21 mins, 1 min) on advertisement stream and TupblingTimeWindows(1 min) on Click stream, then join these two windowed streams.
TupblingTimeWindows could avoid duplicate records in the joined stream.
21 mins size SlidingTimeWindows could avoid missing legal clicks.
One issue is there would be some illegal click(click after 20 mins) in the joined stream. This problem could be fixed easily by adding a filter.
MultiWindowsJoinedStreams<Tuple2<String, Long>, Tuple2<String, Long>> joinedStreams =
new MultiWindowsJoinedStreams<>(advertisement, click);
DataStream<Tuple3<String, Long, Long>> joinedStream = joinedStreams.where(keySelector)
.window(SlidingTimeWindows.of(Time.of(21, TimeUnit.SECONDS), Time.of(1, TimeUnit.SECONDS)))
.equalTo(keySelector)
.window(TumblingTimeWindows.of(Time.of(1, TimeUnit.SECONDS)))
.apply(new JoinFunction<Tuple2<String, Long>, Tuple2<String, Long>, Tuple3<String, Long, Long>>() {
private static final long serialVersionUID = -3625150954096822268L;
#Override
public Tuple3<String, Long, Long> join(Tuple2<String, Long> first, Tuple2<String, Long> second) throws Exception {
return new Tuple3<>(first.f0, first.f1, second.f1);
}
});
joinedStream = joinedStream.filter(new FilterFunction<Tuple3<String, Long, Long>>() {
private static final long serialVersionUID = -4325256210808325338L;
#Override
public boolean filter(Tuple3<String, Long, Long> value) throws Exception {
return value.f1<value.f2&&value.f1+20000>=value.f2;
}
});
Solution 2:
Flink supports join operation without window. A join operator implement the interface TwoInputStreamOperator keeps two buffers(time length based) of these two streams and output one joined stream.
DataStream<Tuple2<String, Long>> advertisement = env
.addSource(new FlinkKafkaConsumer082<String>("advertisement", new SimpleStringSchema(), properties))
.map(new MapFunction<String, Tuple2<String, Long>>() {
private static final long serialVersionUID = -6564495005753073342L;
#Override
public Tuple2<String, Long> map(String value) throws Exception {
String[] splits = value.split(" ");
return new Tuple2<String, Long>(splits[0], Long.parseLong(splits[1]));
}
}).keyBy(keySelector).assignTimestamps(timestampExtractor1);
DataStream<Tuple2<String, Long>> click = env
.addSource(new FlinkKafkaConsumer082<String>("click", new SimpleStringSchema(), properties))
.map(new MapFunction<String, Tuple2<String, Long>>() {
private static final long serialVersionUID = -6564495005753073342L;
#Override
public Tuple2<String, Long> map(String value) throws Exception {
String[] splits = value.split(" ");
return new Tuple2<String, Long>(splits[0], Long.parseLong(splits[1]));
}
}).keyBy(keySelector).assignTimestamps(timestampExtractor2);
NoWindowJoinedStreams<Tuple2<String, Long>, Tuple2<String, Long>> joinedStreams =
new NoWindowJoinedStreams<>(advertisement, click);
DataStream<Tuple3<String, Long, Long>> joinedStream = joinedStreams
.where(keySelector)
.buffer(Time.of(20, TimeUnit.SECONDS))
.equalTo(keySelector)
.buffer(Time.of(5, TimeUnit.SECONDS))
.apply(new JoinFunction<Tuple2<String, Long>, Tuple2<String, Long>, Tuple3<String, Long, Long>>() {
private static final long serialVersionUID = -5075871109025215769L;
#Override
public Tuple3<String, Long, Long> join(Tuple2<String, Long> first, Tuple2<String, Long> second) throws Exception {
return new Tuple3<>(first.f0, first.f1, second.f1);
}
});
I implemented two new join operators base on Flink streaming API TwoInputTransformation. Please check Flink-stream-join. I will add more tests to this repository.
On your code, you defined an overlapping sliding window (slide is smaller than window size). If you don't want to have duplicates you can define a non-overlapping window by only specifying the window size (the default slide is equal to the window size).
While searching for a solution for the same problem, I found the "Interval Join" very useful, which does not repeatedly output the same elements. This is the example from the Flink documentation:
DataStream<Integer> orangeStream = ...
DataStream<Integer> greenStream = ...
orangeStream
.keyBy(<KeySelector>)
.intervalJoin(greenStream.keyBy(<KeySelector>))
.between(Time.milliseconds(-2), Time.milliseconds(1))
.process (new ProcessJoinFunction<Integer, Integer, String(){
#Override
public void processElement(Integer left, Integer right, Context ctx, Collector<String> out) {
out.collect(first + "," + second);
}
});
With this no explicit window has to be defined, instead an interval that is used for each single element like this (image from Flink documentation):

Resources