Flink and Kinesis stream app for non continous data - apache-flink

We've built a Flink app to process data from Kinesis stream. The execution flow of the app contains basic operations for filtering data based on registered types, assigning watermarks based on event timestamps, map, process and aggregate functions applied on windows of data of 5 mins as shown below:
final SingleOutputStreamOperator<Object> inputStream = env.addSource(consumer)
.setParallelism(..)
.filter(..)
.assignTimestampsAndWatermarks(..);
// Processing flow
inputStream
.map(..)
.keyBy(..)
.window(..)
.sideOutputLateData(outputTag)
.aggregate(aggregateFunction, processWindowFunction);
// store processed data to external storage
AsyncDataStream.unorderedWait(...);
Ref code for my watermark assigner:
#Override
public void onEvent(#NonNull final MetricSegment metricSegment,
final long eventTimestamp,
#NonNull final WatermarkOutput watermarkOutput) {
if (eventTimestamp > eventMaxTimestamp) {
currentMaxTimestamp = Instant.now().toEpochMilli();
}
eventMaxTimestamp = Math.max(eventMaxTimestamp, eventTimestamp);
}
#Override
public void onPeriodicEmit(#NonNull final WatermarkOutput watermarkOutput) {
final Instant maxEventTimestamp = Instant.ofEpochMilli(eventMaxTimestamp);
final Duration timeElaspsed = Duration.between(Instant.ofEpochMilli(lastCurrentTimestamp), Instant.now());
if (timeElaspsed.getSeconds() >= emitWatermarkIntervalSec) {
final long watermarkTimestamp = maxEventTimestamp.plus(1, ChronoUnit.MINUTES).toEpochMilli();
watermarkOutput.emitWatermark(new Watermark(watermarkTimestamp));
}
}
Now this app was working with good performance (in terms of latency in order of few seconds) sometime back. However, recently there was a change in the upstream system post which the data in Kinesis stream gets published to the stream in bursts (only for 2-3 hours every day). Post this change, we have seen a huge spike in latency of our app (measured using flink gauge method by recording start time in first filter method and then emitting the metric in Async method by calculating the diff in the timetamp at that point from the start timestmap). Wondering if there is any issue in using Flink apps with Kinesis stream for bursty traffic/non continuous stream of data?

Since the input stream is now idle for long periods of time, this is probably creating situations where the watermarks are held up. If this is the case, then I would expect to see a lot of variance in the latency, as it would (probably) only be the final windows for each burst whose results are delayed until the arrival of the next burst.

Related

Execute flink sink after tumbling window

Source: Kinesis data stream
Sink: Elasticesearch
For both using AWS services.
Also, running my Flink job on AWS Kinesis data analytics application
I am facing an issue with the windowing function of flink. My job looks like this
DataStream<TrackingData> input = ...; // input from kinesis stream
input.keyBy(e -> e.getArea())
.window(TumblingProcessingTimeWindows.of(Time.seconds(10)))
.reduce(new MyReduceFunction(), new MyProcessWindowFunction())
.addSink(<elasticsearch sink>);
private static class MyReduceFunction implements ReduceFunction<TrackingData> {
#Override
public TrackingData reduce(TrackingData trackingData, TrackingData t1) throws Exception {
trackingData.setVideoDuration(trackingData.getVideoDuration() + t1.getVideoDuration());
return trackingData;
}
}
private static class MyProcessWindowFunction extends ProcessWindowFunction<TrackingData, TrackingData, String, TimeWindow> {
public void process(String key,
Context context,
Iterable<TrackingData> in,
Collector<TrackingData> out) {
TrackingData trackingIn = in.iterator().next();
Long videoDuration =0l;
for (TrackingData t: in) {
videoDuration += t.getVideoDuration();
}
trackingIn.setVideoDuration(videoDuration);
out.collect(trackingIn);
}
}
sample event :
{"area":"sessions","userId":4450,"date":"2021-12-03T11:00:00","videoDuration":5}
What I do here is from the kinesis stream I got these events in a large amount I want to sum videoDuration for every 10 seconds of window then I want to store this single event into elasticsearch.
In Kinesis there can be 10,000 events per second. I don't want to store all 10,000 events in elasticsearch i just want to store only one event for every 10 seconds.
The issue is when I send an event to this job it quickly processes this event and directly sinks into elasticsearch but I want to achieve : till every 10 seconds I want events videoDuration time to be incremented and after 10 seconds only one event to be store in elasticearch.
How can I achieve this?
I think you've misdiagnosed the problem.
The code you've written will produce one event from each 10-second-long window for each distinct key that has events during the window. MyProcessWindowFunction isn't having any effect: since the window results have been pre-aggregated, each Iterable will contain exactly one event.
I believe you want to do this instead:
input.keyBy(e -> e.getArea())
.window(TumblingProcessingTimeWindows.of(Time.seconds(10)))
.reduce(new MyReduceFunction())
.windowAll(TumblingProcessingTimeWindows.of(Time.seconds(10)))
.reduce(new MyReduceFunction())
.addSink(<elasticsearch sink>);
You could also just do
input.windowAll(TumblingProcessingTimeWindows.of(Time.seconds(10)))
.reduce(new MyReduceFunction())
.addSink(<elasticsearch sink>);
but the first version will be faster, since it will be able to compute the per-key window results in parallel before computing the global sum in the windowAll.
FWIW, the Table/SQL API is usually a better fit for this type of application, and should produce a more optimized pipeline than either of these.

Checkpoints increasing over time in Flink

in aggregation to this question I'm still not having clear why the checkpoints of my Flink job grows and grows over time and at the moment, for about 7 days running, these checkpoints never gets the plateau.
I'm using Flink 1.10 version at the moment, FS State Backend as my job cannot afford the latency costs of using RocksDB.
See the checkpoints evolve over 7 days:
Let's say that I have this configuration for the TTL of the states in all my stateful operators for one hour or maybe more than that and a day in one case:
public static final StateTtlConfig ttlConfig = StateTtlConfig.newBuilder(Time.hours(1))
.setUpdateType(StateTtlConfig.UpdateType.OnCreateAndWrite)
.setStateVisibility(StateTtlConfig.StateVisibility.NeverReturnExpired)
.cleanupFullSnapshot().build();
In my concern all the objects into the states will be cleaned up after the expires time and therefore the checkpoints size should be reduced, and as we expect more or less the same amount of data everyday.
In the other hand we have a traffic curve, which has more incoming data in some hours of the day, but late night the traffic goes down and all the objects into the states that expires should be cleaned up causing that the checkpoint size should be reduced not kept with the same size until the traffic goes up again.
Let's see this code sample of one use case:
DataStream<Event> stream = addSource(source);
KeyedStream<Event, String> keyedStream = stream.filter((FilterFunction<Event>) event ->
apply filters here;))
.name("Events filtered")
.keyBy(k -> k.rType.equals("something") ? k.id1 : k.id2);
keyedStream.flatMap(new MyFlatMapFunction())
public class MyFlatMapFunction extends RichFlatMapFunction<Event, Event>{
private final MapStateDescriptor<String, Event> descriptor = new MapStateDescriptor<>("prev_state", String.class, Event.class);
private MapState<String, Event> previousState;
#Override
public void open(Configuration parameters) {
/*ttlConfig described above*/
descriptor.enableTimeToLive(ttlConfig);
previousState = getRuntimeContext().getMapState(descriptor);
}
#Override
public void flatMap(Event event, Collector<Event> collector) throws Exception {
final String key = event.rType.equals("something") ? event.id1 : event.id2;
Event previous = previousState.get(key);
if(previous != null){
/*something done here*/
}else /*something done here*/
previousState.put(key, previous);
collector.collect(previous);
}
}
More or less these is the structure of the use cases, and some others that uses Windows(Time Window or Session Window)
Questions:
What am I doing wrong here?
Are the states cleaned up when they expires and this scenario which is the same of the rest of the use cases?
What can help me to fix the checkpoint size if they are working wrong?
Is this behaviour normal?
Kind regards!
In this stretch of code it appears that you are simply writing back the state that was already there, which only serves to reset the TTL timer. This might explain why the state isn't being expired.
Event previous = previousState.get(key);
if (previous != null) {
/*something done here*/
} else
previousState.put(key, previous);
It also appears that you should be using ValueState rather than MapState. ValueState effectively provides a sharded key/value store, where the keys are the keys used to partition the stream in the keyBy. MapState gives you a nested map for each key, rather than a single value. But since you are using the same key inside the flatMap that you used to key the stream originally, key-partitioned ValueState would appear to be all that you need.

Buffering transformed messages(example, 1000 count) using Apache Flink stream processing

I'm using Apache Flink for stream processing.
After subscribing the messages from source(ex:Kafka, AWS Kinesis Data Streams) and then applying transformation, aggregation and etc. using Flink operators on streaming data I want to buffer final messages(ex:1000 in count) and post each batch in a single request to external REST API.
How to implement buffering mechanism(creating each 1000 records as a batch) in Apache Flink?
Flink pipileine: streaming Source --> transform/reduce using Operators --> buffer 1000 messages --> post to REST API
Appreciate your help!
I'd create a sink with state that would hold on to the messages that are passed in. When the count gets high enough (1000) the sink sends the batch. The state can be in memory (e.g. an instance variable holding an ArrayList of messages), but you should use checkpoints so that you can recover that state in case of a failure of some kind.
When your sink has checkpointed state, it needs to implement CheckpointedFunction (in org.apache.flink.streaming.api.checkpoint) which means you need to add two methods to your sink:
#Override
public void snapshotState(FunctionSnapshotContext context) throws Exception {
checkpointedState.clear();
// HttpSinkStateItem is a user-written class
// that just holds a collection of messages (Strings, in this case)
//
// Buffer is declared as ArrayList<String>
checkpointedState.add(new HttpSinkStateItem(buffer));
}
#Override
public void initializeState(FunctionInitializationContext context) throws Exception {
// Mix and match different kinds of states as needed:
// - Use context.getOperatorStateStore() to get basic (non-keyed) operator state
// - types are list and union
// - Use context.getKeyedStateStore() to get state for the current key (only for processing keyed streams)
// - types are value, list, reducing, aggregating and map
// - Distinguish between state data using state name (e.g. "HttpSink-State")
ListStateDescriptor<HttpSinkStateItem> descriptor =
new ListStateDescriptor<>(
"HttpSink-State",
HttpSinkStateItem.class);
checkpointedState = context.getOperatorStateStore().getListState(descriptor);
if (context.isRestored()) {
for (HttpSinkStateItem item: checkpointedState.get()) {
buffer = new ArrayList<>(item.getPending());
}
}
}
You can also use a timer in the sink (if the input stream is keyed/partitioned) to send periodically if the count doesn't reach your threshold.

Flink how set up initial watermark

I am building a streaming app using Flink 1.3.2 with scala, my Flink app will monitor a folder and stream new files into pipeline. Each record in the file has a timestamp associated. I want to use this timestamp as the event time and build watermark using AssignerWithPeriodicWatermarks[T], my watermark generator looks like below:
class TimeLagWatermarkGenerator extends AssignerWithPeriodicWatermarks[Activity] {
val maxTimeLag = 6 * 3600000L // 6 hours
override def extractTimestamp(element: Activity, previousElementTimestamp: Long): Long = {
val format = new SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ssXXX")
val timestampString = element.getTimestamp
}
override def getCurrentWatermark(): Watermark = {
new Watermark(System.currentTimeMillis() - maxTimeLag)
}
}
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)
env.getConfig.setAutoWatermarkInterval(10000L)
val stream = env.readFile(inputformart, path, FileProcessingMode.PROCESS_CONTINUOUSLY, 100)
val activity = stream
.assignTimestampsAndWatermarks(new TimeLagWatermarkGenerator())
.map { line =>
new tuple.Tuple2(line.id, line.count)
}.keyBy(0).addSink(...)
However, since my folder has some old data there, I don't want to process them. And the timestamp of records in older file are > 6 hours, which should be older than watermark. However, when I start running it, I can still see some initial output been created. I was wondering how the initial value of watermark been set up, is it the before the first interval or after? It might be I misunderstand something here but need some advice.
There are no operators in the pipeline you've shown that care about time -- no windowing, no ProcessFunction timers -- so every stream element will pass thru unimpeded and be processed. If your goal is to skip elements that are late you'll need to introduce something that (somehow) compares event timestamps to the current watermark.
You could do this by introducing a step between the keyBy and sink, like this:
...
.keyBy(0)
.process(new DropLateEvents())
.addSink(...)
public static class DropLateEvents extends ProcessFunction<...> {
#Override
public void processElement(... event, Context context, Collector<...> out) throws Exception {
TimerService timerService = context.timerService();
if (context.timestamp() > timerService.currentWatermark()) {
out.collect(event);
}
}
}
Having done this, your question about the initial watermark becomes relevant. With periodic watermarks, the initial watermark is Long.MIN_VALUE, so nothing will be considered late until the first watermark is emitted, which will happen after 10 seconds of operation (given how you've set the auto-watermarking interval).
The relevant code is here if you want to see how periodic watermarks are generated in more detail.
If you want to avoid processing late elements during the first 10 seconds, you could simply forget about using event time and watermarking entirely, and simply modify the processElement method shown above to compare the event timestamps to System.currentTimeMillis() - maxTimeLag rather than to the current watermark. Another solution would be to use punctuated watermarking, and emit a watermark with the very first event.
Or even more simply, you could detect and drop late events in a flatMap or filter, since you are defining lateness relative to System.currentTimeMillis() rather than to the watermarks.

Non blocking streaming on Flink

Hi, I'm trying to run a Flink job that it should process incoming data as below. In the process operator right after keyBy(), there should be a case that takes too much time according to some property in data. Even though incoming data have different ids (which is used to keyBy() the stream), long processing code in process function blocks other incoming data. I mean the entire stream.
SingleOutputStreamOperator<Envelope> processingStream = deviceStream
.map(e -> (Envelope) e)
.keyBy((KeySelector<Envelope, String>) value -> value.eventId) // key by scenarios
.process(new RuleProcessFunction());
In RuleProcessFunction.java:
...
#Override
public void processElement(Envelope value, Context ctx, Collector<Envelope> out) throws Exception {
//handleEvent(value, ctx, out);
if (value.getEventId().equals("I")) {
System.out.println("hello i");
for (long i = 0; i < 10000000000L; i++) {
}
}
out.collect(value);
}
I expect the long-running code block should not block the entire stream. I know there is AsyncFunction for blocking IO situations but I don't know that it's correct solution for this.
Since you aren't pulling data from an external database like Cassandra, I don't think you need to use an AsyncFunction.
What it could be that you are running the flink job with a single parallelism. Try increasing the parallelism so one core isn't responsible for all of the processing as well as receiving data. Granted, there can still be back pressure if you do this. Since if the core responsible for ingesting data from the source is reading in data faster than the core(s) that are running the processFunction Flink's back pressure handling will slow the rate of ingestion.

Resources