Execute flink sink after tumbling window - apache-flink

Source: Kinesis data stream
Sink: Elasticesearch
For both using AWS services.
Also, running my Flink job on AWS Kinesis data analytics application
I am facing an issue with the windowing function of flink. My job looks like this
DataStream<TrackingData> input = ...; // input from kinesis stream
input.keyBy(e -> e.getArea())
.window(TumblingProcessingTimeWindows.of(Time.seconds(10)))
.reduce(new MyReduceFunction(), new MyProcessWindowFunction())
.addSink(<elasticsearch sink>);
private static class MyReduceFunction implements ReduceFunction<TrackingData> {
#Override
public TrackingData reduce(TrackingData trackingData, TrackingData t1) throws Exception {
trackingData.setVideoDuration(trackingData.getVideoDuration() + t1.getVideoDuration());
return trackingData;
}
}
private static class MyProcessWindowFunction extends ProcessWindowFunction<TrackingData, TrackingData, String, TimeWindow> {
public void process(String key,
Context context,
Iterable<TrackingData> in,
Collector<TrackingData> out) {
TrackingData trackingIn = in.iterator().next();
Long videoDuration =0l;
for (TrackingData t: in) {
videoDuration += t.getVideoDuration();
}
trackingIn.setVideoDuration(videoDuration);
out.collect(trackingIn);
}
}
sample event :
{"area":"sessions","userId":4450,"date":"2021-12-03T11:00:00","videoDuration":5}
What I do here is from the kinesis stream I got these events in a large amount I want to sum videoDuration for every 10 seconds of window then I want to store this single event into elasticsearch.
In Kinesis there can be 10,000 events per second. I don't want to store all 10,000 events in elasticsearch i just want to store only one event for every 10 seconds.
The issue is when I send an event to this job it quickly processes this event and directly sinks into elasticsearch but I want to achieve : till every 10 seconds I want events videoDuration time to be incremented and after 10 seconds only one event to be store in elasticearch.
How can I achieve this?

I think you've misdiagnosed the problem.
The code you've written will produce one event from each 10-second-long window for each distinct key that has events during the window. MyProcessWindowFunction isn't having any effect: since the window results have been pre-aggregated, each Iterable will contain exactly one event.
I believe you want to do this instead:
input.keyBy(e -> e.getArea())
.window(TumblingProcessingTimeWindows.of(Time.seconds(10)))
.reduce(new MyReduceFunction())
.windowAll(TumblingProcessingTimeWindows.of(Time.seconds(10)))
.reduce(new MyReduceFunction())
.addSink(<elasticsearch sink>);
You could also just do
input.windowAll(TumblingProcessingTimeWindows.of(Time.seconds(10)))
.reduce(new MyReduceFunction())
.addSink(<elasticsearch sink>);
but the first version will be faster, since it will be able to compute the per-key window results in parallel before computing the global sum in the windowAll.
FWIW, the Table/SQL API is usually a better fit for this type of application, and should produce a more optimized pipeline than either of these.

Related

Flink and Kinesis stream app for non continous data

We've built a Flink app to process data from Kinesis stream. The execution flow of the app contains basic operations for filtering data based on registered types, assigning watermarks based on event timestamps, map, process and aggregate functions applied on windows of data of 5 mins as shown below:
final SingleOutputStreamOperator<Object> inputStream = env.addSource(consumer)
.setParallelism(..)
.filter(..)
.assignTimestampsAndWatermarks(..);
// Processing flow
inputStream
.map(..)
.keyBy(..)
.window(..)
.sideOutputLateData(outputTag)
.aggregate(aggregateFunction, processWindowFunction);
// store processed data to external storage
AsyncDataStream.unorderedWait(...);
Ref code for my watermark assigner:
#Override
public void onEvent(#NonNull final MetricSegment metricSegment,
final long eventTimestamp,
#NonNull final WatermarkOutput watermarkOutput) {
if (eventTimestamp > eventMaxTimestamp) {
currentMaxTimestamp = Instant.now().toEpochMilli();
}
eventMaxTimestamp = Math.max(eventMaxTimestamp, eventTimestamp);
}
#Override
public void onPeriodicEmit(#NonNull final WatermarkOutput watermarkOutput) {
final Instant maxEventTimestamp = Instant.ofEpochMilli(eventMaxTimestamp);
final Duration timeElaspsed = Duration.between(Instant.ofEpochMilli(lastCurrentTimestamp), Instant.now());
if (timeElaspsed.getSeconds() >= emitWatermarkIntervalSec) {
final long watermarkTimestamp = maxEventTimestamp.plus(1, ChronoUnit.MINUTES).toEpochMilli();
watermarkOutput.emitWatermark(new Watermark(watermarkTimestamp));
}
}
Now this app was working with good performance (in terms of latency in order of few seconds) sometime back. However, recently there was a change in the upstream system post which the data in Kinesis stream gets published to the stream in bursts (only for 2-3 hours every day). Post this change, we have seen a huge spike in latency of our app (measured using flink gauge method by recording start time in first filter method and then emitting the metric in Async method by calculating the diff in the timetamp at that point from the start timestmap). Wondering if there is any issue in using Flink apps with Kinesis stream for bursty traffic/non continuous stream of data?
Since the input stream is now idle for long periods of time, this is probably creating situations where the watermarks are held up. If this is the case, then I would expect to see a lot of variance in the latency, as it would (probably) only be the final windows for each burst whose results are delayed until the arrival of the next burst.

Check if all I'm receiving stream properly with all keys

I have the following scenario: suppose there are 20 sensors which are sending me streaming feed. I apply a keyBy (sensorID) against the stream and perform some operations such as average etc. This is implemented, and running well (using Flink Java API).
Initially it's all going well and all the sensors are sending me feed. After a certain time, it may happen that a couple of sensors start misbehaving and I start getting irregular feed from them e.g. I receive feed from 18 sensors,but 2 don't send me feed for long durations.
We can assume that I already know the fixed list of sensorId's (possibly hard-coded / or in a database). How do I identify which two are not sending feed? Where can I get the list of keyId's to compare with the list in database?
I want to raise an alarm if I don't get a feed (e.g 2 mins, 5 mins, 10 mins etc. with increasing priority).
Has anyone implemented such a scenario using flink-streaming / patterns? Any suggestions please.
You could technically use the ProcessFunction and timers.
You could simply register timer for each record and reset it if You receive data. If You schedule the timer to run after 5 mins processing time, this would basically mean that If You haven't received the data it would call function onTimer, from which You could simply emit some alert. It would be possible to re-register the timers for already fired alerts to allow emitting alerts with higher severity.
Note that this will only work assuming that initially, all sensors are working correctly. Specifically, it will only emit alerts for keys that have been seen at least once. But from your description it seems that It would solve Your problem.
I just happen to have an example of this pattern lying around. It'll need some adjustment to fit your use case, but should get you started.
public class TimeoutFunction extends KeyedProcessFunction<String, Event, String> {
private ValueState<Long> lastModifiedState;
static final int TIMEOUT = 2 * 60 * 1000; // 2 minutes
#Override
public void open(Configuration parameters) throws Exception {
// register our state with the state backend
state = getRuntimeContext().getState(new ValueStateDescriptor<>("myState", Long.class));
}
#Override
public void processElement(Event event, Context ctx, Collector<String> out) throws Exception {
// update our state and timer
Long current = lastModifiedState.value();
if (current != null) {
ctx.timerService().deleteEventTimeTimer(current + TIMEOUT);
}
current = max(current, event.timestamp());
lastModifiedState.update(current);
ctx.timerService().registerEventTimeTimer(current + TIMEOUT);
}
#Override
public void onTimer(long timestamp, OnTimerContext ctx, Collector<String> out) throws Exception {
// emit alert
String deviceId = ctx.getCurrentKey();
out.collect(deviceId);
}
}
This assumes a main program that does something like this:
DataStream<String> result = stream
.assignTimestampsAndWatermarks(new MyBoundedOutOfOrdernessAssigner(...))
.keyBy(e -> e.deviceId)
.process(new TimeoutFunction());
As #Dominik said, this only emits alerts for keys that have been seen at least once. You could fix that by introducing a secondary source of events that creates an artificial event for every source that should exist, and union that stream with the primary source.
The pattern is very clear to me now. I've implemented the solution and it works like charm.
If anyone needs the code, then I'll be happy to share

How to get the Ingestion time of an event when the time characteristic is event-time?

I want to evaluate the time costed between an event reaches the system and get finished, and I think getting ingestion time will help, but how to do get it?
You probably want to use latency tracking. Alternatively, you can add the processing time directly after the source in a chained process function (with Context->TimerService#currentProcessingTime()).
Based on the reply from David, to get the ingest time we can chain the process method with source.
Below code shows the way to get the ingest time. Also in case the same need to be used for metrics to get the difference between ingest time & event time, I have used histogram metric group to do that.
Below code snippet might help you to better understand.
DataStream<EventDataMapping> text = env
.fromSource(source, WatermarkStrategy.forBoundedOutOfOrderness(Duration.ofSeconds(5)),"Kafka Source")
.process(new ProcessFunction<EventDataMapping, EventDataMapping>() {
private transient DescriptiveStatisticsHistogram eventVsIngestionTimeLag;
private static final int EVENT_TIME_LAG_WINDOW_SIZE = 10_000;
#Override
public void open(Configuration parameters) throws Exception {
super.open(parameters);
eventVsIngestionTimeLag = getRuntimeContext().getMetricGroup().histogram("eventVsIngestionTimeLag",
new DescriptiveStatisticsHistogram(EVENT_TIME_LAG_WINDOW_SIZE));
}
#Override
public void processElement(EventDataMapping eventDataMapping, Context context, Collector<EventDataMapping> collector) throws Exception {
LOG.info("process element event time "+context.timestamp()+" current ingestTime "+context.timerService().currentProcessingTime());
eventVsIngestionTimeLag.update(context.timerService().currentProcessingTime() - context.timestamp());
}
}).returns(EventDataMapping.class);

how to emit result of processing an event without delay in flink

We are considering flink for a usecase, but not sure whether flink is suitable for it. Here is my usecase. When an event e1 arrives, we need to process it and emit a result. Source and sink are not relevant for this discussion but you can think of a message queue service as source and sink. Entire processing of an event is independent of other events. So while processing event e1, we don't need e2 or any other event. As part of the processing, we need to do step1, step2, step3, step4 as shown in the below diagram. Note that step2 and step3 should be done in parallel.
The processing latency of an event is critical for us. So I need to emit the result as soon as processing is complete for that element instead of waiting for some window timeout. With my limited knowledge in Flink, I could only think of the below approach
DataStream<Map<String, Object>> step1 = env.addSource(...);
DataStream<Map<String, Object>> step2 = step1.map(...);
DataStream<Map<String, Object>> step3 = step1.map(...);
Now, how do I combine the results from step2 and step3 and emit the result? In this simple example I only have two steams to merge, but it can be more than 2 as well. I could do a union of the streams. I can have a unique event id to group the outputs of intermediates steps related to a particular event.
DataStream<Map<String, Object>> mergedStream = step1.union(step2).keyBy(...);
But how to emit the result? Ideally, I would like to say "emit the result as soon as I get output from step2 and step3 for a specific key" instead of "emit the result every 30 millis". The later has two problems: it may emit partial results and it has delay. Is there any way to specify the former?
I'm exploring Flink, but I'm open to consider other alternatives if it solves my usecase.
In step 1, add an event id. Then after the union, key the stream by the event id and use a RichFlatMapFunction to combine the results of steps 2 and 3 back into a single event. If steps 2 and 3 emit events of type EnrichedEvent, then step 4 can be:
static class FanIn extends RichFlatMapFunction<EnrichedEvent, EnrichedEvent> {
private transient ValueState<EnrichedEvent> enrichmentResponseState;
#Override
public void flatMap(EnrichedEvent value, Collector<EnrichedEvent> out) throws Exception {
EnrichedEvent response = enrichmentResponseState.value();
if (response != null) {
response = response.combine(value);
} else {
response = value;
}
if (response.isComplete()) {
out.collect(response);
enrichmentResponseState.clear();
} else {
enrichmentResponseState.update(response);
}
}
#Override
public void open(Configuration config) {
ValueStateDescriptor<EnrichedEvent> fanInStateDescriptor =
new ValueStateDescriptor<>( "enrichmentResponse",
TypeInformation.of(new TypeHint<EnrichedEvent>() {})
);
enrichmentResponseState = getRuntimeContext().getState(fanInStateDescriptor);
}
}
After that it's a simple matter to send the merged final result to a sink.

Flink how set up initial watermark

I am building a streaming app using Flink 1.3.2 with scala, my Flink app will monitor a folder and stream new files into pipeline. Each record in the file has a timestamp associated. I want to use this timestamp as the event time and build watermark using AssignerWithPeriodicWatermarks[T], my watermark generator looks like below:
class TimeLagWatermarkGenerator extends AssignerWithPeriodicWatermarks[Activity] {
val maxTimeLag = 6 * 3600000L // 6 hours
override def extractTimestamp(element: Activity, previousElementTimestamp: Long): Long = {
val format = new SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ssXXX")
val timestampString = element.getTimestamp
}
override def getCurrentWatermark(): Watermark = {
new Watermark(System.currentTimeMillis() - maxTimeLag)
}
}
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)
env.getConfig.setAutoWatermarkInterval(10000L)
val stream = env.readFile(inputformart, path, FileProcessingMode.PROCESS_CONTINUOUSLY, 100)
val activity = stream
.assignTimestampsAndWatermarks(new TimeLagWatermarkGenerator())
.map { line =>
new tuple.Tuple2(line.id, line.count)
}.keyBy(0).addSink(...)
However, since my folder has some old data there, I don't want to process them. And the timestamp of records in older file are > 6 hours, which should be older than watermark. However, when I start running it, I can still see some initial output been created. I was wondering how the initial value of watermark been set up, is it the before the first interval or after? It might be I misunderstand something here but need some advice.
There are no operators in the pipeline you've shown that care about time -- no windowing, no ProcessFunction timers -- so every stream element will pass thru unimpeded and be processed. If your goal is to skip elements that are late you'll need to introduce something that (somehow) compares event timestamps to the current watermark.
You could do this by introducing a step between the keyBy and sink, like this:
...
.keyBy(0)
.process(new DropLateEvents())
.addSink(...)
public static class DropLateEvents extends ProcessFunction<...> {
#Override
public void processElement(... event, Context context, Collector<...> out) throws Exception {
TimerService timerService = context.timerService();
if (context.timestamp() > timerService.currentWatermark()) {
out.collect(event);
}
}
}
Having done this, your question about the initial watermark becomes relevant. With periodic watermarks, the initial watermark is Long.MIN_VALUE, so nothing will be considered late until the first watermark is emitted, which will happen after 10 seconds of operation (given how you've set the auto-watermarking interval).
The relevant code is here if you want to see how periodic watermarks are generated in more detail.
If you want to avoid processing late elements during the first 10 seconds, you could simply forget about using event time and watermarking entirely, and simply modify the processElement method shown above to compare the event timestamps to System.currentTimeMillis() - maxTimeLag rather than to the current watermark. Another solution would be to use punctuated watermarking, and emit a watermark with the very first event.
Or even more simply, you could detect and drop late events in a flatMap or filter, since you are defining lateness relative to System.currentTimeMillis() rather than to the watermarks.

Resources