We are considering flink for a usecase, but not sure whether flink is suitable for it. Here is my usecase. When an event e1 arrives, we need to process it and emit a result. Source and sink are not relevant for this discussion but you can think of a message queue service as source and sink. Entire processing of an event is independent of other events. So while processing event e1, we don't need e2 or any other event. As part of the processing, we need to do step1, step2, step3, step4 as shown in the below diagram. Note that step2 and step3 should be done in parallel.
The processing latency of an event is critical for us. So I need to emit the result as soon as processing is complete for that element instead of waiting for some window timeout. With my limited knowledge in Flink, I could only think of the below approach
DataStream<Map<String, Object>> step1 = env.addSource(...);
DataStream<Map<String, Object>> step2 = step1.map(...);
DataStream<Map<String, Object>> step3 = step1.map(...);
Now, how do I combine the results from step2 and step3 and emit the result? In this simple example I only have two steams to merge, but it can be more than 2 as well. I could do a union of the streams. I can have a unique event id to group the outputs of intermediates steps related to a particular event.
DataStream<Map<String, Object>> mergedStream = step1.union(step2).keyBy(...);
But how to emit the result? Ideally, I would like to say "emit the result as soon as I get output from step2 and step3 for a specific key" instead of "emit the result every 30 millis". The later has two problems: it may emit partial results and it has delay. Is there any way to specify the former?
I'm exploring Flink, but I'm open to consider other alternatives if it solves my usecase.
In step 1, add an event id. Then after the union, key the stream by the event id and use a RichFlatMapFunction to combine the results of steps 2 and 3 back into a single event. If steps 2 and 3 emit events of type EnrichedEvent, then step 4 can be:
static class FanIn extends RichFlatMapFunction<EnrichedEvent, EnrichedEvent> {
private transient ValueState<EnrichedEvent> enrichmentResponseState;
#Override
public void flatMap(EnrichedEvent value, Collector<EnrichedEvent> out) throws Exception {
EnrichedEvent response = enrichmentResponseState.value();
if (response != null) {
response = response.combine(value);
} else {
response = value;
}
if (response.isComplete()) {
out.collect(response);
enrichmentResponseState.clear();
} else {
enrichmentResponseState.update(response);
}
}
#Override
public void open(Configuration config) {
ValueStateDescriptor<EnrichedEvent> fanInStateDescriptor =
new ValueStateDescriptor<>( "enrichmentResponse",
TypeInformation.of(new TypeHint<EnrichedEvent>() {})
);
enrichmentResponseState = getRuntimeContext().getState(fanInStateDescriptor);
}
}
After that it's a simple matter to send the merged final result to a sink.
Related
Source: Kinesis data stream
Sink: Elasticesearch
For both using AWS services.
Also, running my Flink job on AWS Kinesis data analytics application
I am facing an issue with the windowing function of flink. My job looks like this
DataStream<TrackingData> input = ...; // input from kinesis stream
input.keyBy(e -> e.getArea())
.window(TumblingProcessingTimeWindows.of(Time.seconds(10)))
.reduce(new MyReduceFunction(), new MyProcessWindowFunction())
.addSink(<elasticsearch sink>);
private static class MyReduceFunction implements ReduceFunction<TrackingData> {
#Override
public TrackingData reduce(TrackingData trackingData, TrackingData t1) throws Exception {
trackingData.setVideoDuration(trackingData.getVideoDuration() + t1.getVideoDuration());
return trackingData;
}
}
private static class MyProcessWindowFunction extends ProcessWindowFunction<TrackingData, TrackingData, String, TimeWindow> {
public void process(String key,
Context context,
Iterable<TrackingData> in,
Collector<TrackingData> out) {
TrackingData trackingIn = in.iterator().next();
Long videoDuration =0l;
for (TrackingData t: in) {
videoDuration += t.getVideoDuration();
}
trackingIn.setVideoDuration(videoDuration);
out.collect(trackingIn);
}
}
sample event :
{"area":"sessions","userId":4450,"date":"2021-12-03T11:00:00","videoDuration":5}
What I do here is from the kinesis stream I got these events in a large amount I want to sum videoDuration for every 10 seconds of window then I want to store this single event into elasticsearch.
In Kinesis there can be 10,000 events per second. I don't want to store all 10,000 events in elasticsearch i just want to store only one event for every 10 seconds.
The issue is when I send an event to this job it quickly processes this event and directly sinks into elasticsearch but I want to achieve : till every 10 seconds I want events videoDuration time to be incremented and after 10 seconds only one event to be store in elasticearch.
How can I achieve this?
I think you've misdiagnosed the problem.
The code you've written will produce one event from each 10-second-long window for each distinct key that has events during the window. MyProcessWindowFunction isn't having any effect: since the window results have been pre-aggregated, each Iterable will contain exactly one event.
I believe you want to do this instead:
input.keyBy(e -> e.getArea())
.window(TumblingProcessingTimeWindows.of(Time.seconds(10)))
.reduce(new MyReduceFunction())
.windowAll(TumblingProcessingTimeWindows.of(Time.seconds(10)))
.reduce(new MyReduceFunction())
.addSink(<elasticsearch sink>);
You could also just do
input.windowAll(TumblingProcessingTimeWindows.of(Time.seconds(10)))
.reduce(new MyReduceFunction())
.addSink(<elasticsearch sink>);
but the first version will be faster, since it will be able to compute the per-key window results in parallel before computing the global sum in the windowAll.
FWIW, the Table/SQL API is usually a better fit for this type of application, and should produce a more optimized pipeline than either of these.
in aggregation to this question I'm still not having clear why the checkpoints of my Flink job grows and grows over time and at the moment, for about 7 days running, these checkpoints never gets the plateau.
I'm using Flink 1.10 version at the moment, FS State Backend as my job cannot afford the latency costs of using RocksDB.
See the checkpoints evolve over 7 days:
Let's say that I have this configuration for the TTL of the states in all my stateful operators for one hour or maybe more than that and a day in one case:
public static final StateTtlConfig ttlConfig = StateTtlConfig.newBuilder(Time.hours(1))
.setUpdateType(StateTtlConfig.UpdateType.OnCreateAndWrite)
.setStateVisibility(StateTtlConfig.StateVisibility.NeverReturnExpired)
.cleanupFullSnapshot().build();
In my concern all the objects into the states will be cleaned up after the expires time and therefore the checkpoints size should be reduced, and as we expect more or less the same amount of data everyday.
In the other hand we have a traffic curve, which has more incoming data in some hours of the day, but late night the traffic goes down and all the objects into the states that expires should be cleaned up causing that the checkpoint size should be reduced not kept with the same size until the traffic goes up again.
Let's see this code sample of one use case:
DataStream<Event> stream = addSource(source);
KeyedStream<Event, String> keyedStream = stream.filter((FilterFunction<Event>) event ->
apply filters here;))
.name("Events filtered")
.keyBy(k -> k.rType.equals("something") ? k.id1 : k.id2);
keyedStream.flatMap(new MyFlatMapFunction())
public class MyFlatMapFunction extends RichFlatMapFunction<Event, Event>{
private final MapStateDescriptor<String, Event> descriptor = new MapStateDescriptor<>("prev_state", String.class, Event.class);
private MapState<String, Event> previousState;
#Override
public void open(Configuration parameters) {
/*ttlConfig described above*/
descriptor.enableTimeToLive(ttlConfig);
previousState = getRuntimeContext().getMapState(descriptor);
}
#Override
public void flatMap(Event event, Collector<Event> collector) throws Exception {
final String key = event.rType.equals("something") ? event.id1 : event.id2;
Event previous = previousState.get(key);
if(previous != null){
/*something done here*/
}else /*something done here*/
previousState.put(key, previous);
collector.collect(previous);
}
}
More or less these is the structure of the use cases, and some others that uses Windows(Time Window or Session Window)
Questions:
What am I doing wrong here?
Are the states cleaned up when they expires and this scenario which is the same of the rest of the use cases?
What can help me to fix the checkpoint size if they are working wrong?
Is this behaviour normal?
Kind regards!
In this stretch of code it appears that you are simply writing back the state that was already there, which only serves to reset the TTL timer. This might explain why the state isn't being expired.
Event previous = previousState.get(key);
if (previous != null) {
/*something done here*/
} else
previousState.put(key, previous);
It also appears that you should be using ValueState rather than MapState. ValueState effectively provides a sharded key/value store, where the keys are the keys used to partition the stream in the keyBy. MapState gives you a nested map for each key, rather than a single value. But since you are using the same key inside the flatMap that you used to key the stream originally, key-partitioned ValueState would appear to be all that you need.
I have the following scenario: suppose there are 20 sensors which are sending me streaming feed. I apply a keyBy (sensorID) against the stream and perform some operations such as average etc. This is implemented, and running well (using Flink Java API).
Initially it's all going well and all the sensors are sending me feed. After a certain time, it may happen that a couple of sensors start misbehaving and I start getting irregular feed from them e.g. I receive feed from 18 sensors,but 2 don't send me feed for long durations.
We can assume that I already know the fixed list of sensorId's (possibly hard-coded / or in a database). How do I identify which two are not sending feed? Where can I get the list of keyId's to compare with the list in database?
I want to raise an alarm if I don't get a feed (e.g 2 mins, 5 mins, 10 mins etc. with increasing priority).
Has anyone implemented such a scenario using flink-streaming / patterns? Any suggestions please.
You could technically use the ProcessFunction and timers.
You could simply register timer for each record and reset it if You receive data. If You schedule the timer to run after 5 mins processing time, this would basically mean that If You haven't received the data it would call function onTimer, from which You could simply emit some alert. It would be possible to re-register the timers for already fired alerts to allow emitting alerts with higher severity.
Note that this will only work assuming that initially, all sensors are working correctly. Specifically, it will only emit alerts for keys that have been seen at least once. But from your description it seems that It would solve Your problem.
I just happen to have an example of this pattern lying around. It'll need some adjustment to fit your use case, but should get you started.
public class TimeoutFunction extends KeyedProcessFunction<String, Event, String> {
private ValueState<Long> lastModifiedState;
static final int TIMEOUT = 2 * 60 * 1000; // 2 minutes
#Override
public void open(Configuration parameters) throws Exception {
// register our state with the state backend
state = getRuntimeContext().getState(new ValueStateDescriptor<>("myState", Long.class));
}
#Override
public void processElement(Event event, Context ctx, Collector<String> out) throws Exception {
// update our state and timer
Long current = lastModifiedState.value();
if (current != null) {
ctx.timerService().deleteEventTimeTimer(current + TIMEOUT);
}
current = max(current, event.timestamp());
lastModifiedState.update(current);
ctx.timerService().registerEventTimeTimer(current + TIMEOUT);
}
#Override
public void onTimer(long timestamp, OnTimerContext ctx, Collector<String> out) throws Exception {
// emit alert
String deviceId = ctx.getCurrentKey();
out.collect(deviceId);
}
}
This assumes a main program that does something like this:
DataStream<String> result = stream
.assignTimestampsAndWatermarks(new MyBoundedOutOfOrdernessAssigner(...))
.keyBy(e -> e.deviceId)
.process(new TimeoutFunction());
As #Dominik said, this only emits alerts for keys that have been seen at least once. You could fix that by introducing a secondary source of events that creates an artificial event for every source that should exist, and union that stream with the primary source.
The pattern is very clear to me now. I've implemented the solution and it works like charm.
If anyone needs the code, then I'll be happy to share
How do I assign an id to a session window in Apache Flink?
Ultimately I want to enrich events with a session window id one-by-one while the session windows is open (I don't want to wait until the window closes before emitting the enriched events).
I tried to do this with an AggregateFunction, however I don't think merge() works as I expect. It seems to be for merging windows and not panes (trigger firings). It seems to be never called in my pipeline. It seems therefore that there is no shared state between triggers!
The session window id will be the timestamp of the first event to fall into the window (due to non-guaranteed ordering that may mean some events with could potentially fall into the same session window with an earlier timestamp - I'm ok with this).
public class FooSessionState {
private Long sessionCreationTime;
private FooMatch lastMatch;
}
/**
* Aggregator that assigns session ids to elements of a session window
*/
public class SessionIdAssigner implements
AggregateFunction<FooMatch, FooSessionState, FooSessionEvent> {
static final long serialVersionUID = 0L;
#Override
public FooSessionState createAccumulator() {
return new FooSessionState();
}
#Override
public FooSessionState add(FooMatch value, FooSessionState sessionState) {
if (sessionState.getSessionCreationTime() == null) {
sessionState.setSessionCreationTime(value.getReport().getTimestamp());
}
sessionState.setLastMatch(value);
return sessionState;
}
#Override
public FooSessionEvent getResult(FooSessionState accumulator) {
FooSessionEvent sessionEvent = new FooSessionEvent();
sessionEvent.setFooMatch(accumulator.getLastMatch());
sessionEvent.setSessionCreationTime(accumulator.getSessionCreationTime());
return sessionEvent;
}
#Override
public FooSessionState merge(FooSessionState a, FooSessionState b) {
if ( a.getSessionCreationTime() != null) {
b.setSessionCreationTime(a.getSessionCreationTime());
}
return b;
}
}
My plan was to use it as follows:
stream.keyBy(new FooMatchKeySelector())
.window(EventTimeSessionWindows.withGap(Time.milliseconds(config.getFooSessionWindowTimeout())))
.trigger(PurgingTrigger.of(CountTrigger.of(1L)))
.aggregate(new SessionIdAssigner())
I think session windows are not a good fit for what you want to achieve. They have been designed to aggregate events per session, but not to enrich every event, i.e., they compute a result and emit it when the window is closed. As you noticed, session windows work by creating a new window for every event and merging windows that overlap. This design was chosen, because events might arrive out of order. Hence it might happen, that you have two windows that are later connected by a bridging event.
I would recommend to implement the logic with a ProcessFunction that collects the events and sorts them on their timestamp. When a watermark is received, it emits all collected events with correct session IDs. Hence, you keep only the events between two watermarks in state. In addition to those events, you need to keep the timestamp of the last emitted event and the last emitted session ID to perform correct sessionization.
I am building a streaming app using Flink 1.3.2 with scala, my Flink app will monitor a folder and stream new files into pipeline. Each record in the file has a timestamp associated. I want to use this timestamp as the event time and build watermark using AssignerWithPeriodicWatermarks[T], my watermark generator looks like below:
class TimeLagWatermarkGenerator extends AssignerWithPeriodicWatermarks[Activity] {
val maxTimeLag = 6 * 3600000L // 6 hours
override def extractTimestamp(element: Activity, previousElementTimestamp: Long): Long = {
val format = new SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ssXXX")
val timestampString = element.getTimestamp
}
override def getCurrentWatermark(): Watermark = {
new Watermark(System.currentTimeMillis() - maxTimeLag)
}
}
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)
env.getConfig.setAutoWatermarkInterval(10000L)
val stream = env.readFile(inputformart, path, FileProcessingMode.PROCESS_CONTINUOUSLY, 100)
val activity = stream
.assignTimestampsAndWatermarks(new TimeLagWatermarkGenerator())
.map { line =>
new tuple.Tuple2(line.id, line.count)
}.keyBy(0).addSink(...)
However, since my folder has some old data there, I don't want to process them. And the timestamp of records in older file are > 6 hours, which should be older than watermark. However, when I start running it, I can still see some initial output been created. I was wondering how the initial value of watermark been set up, is it the before the first interval or after? It might be I misunderstand something here but need some advice.
There are no operators in the pipeline you've shown that care about time -- no windowing, no ProcessFunction timers -- so every stream element will pass thru unimpeded and be processed. If your goal is to skip elements that are late you'll need to introduce something that (somehow) compares event timestamps to the current watermark.
You could do this by introducing a step between the keyBy and sink, like this:
...
.keyBy(0)
.process(new DropLateEvents())
.addSink(...)
public static class DropLateEvents extends ProcessFunction<...> {
#Override
public void processElement(... event, Context context, Collector<...> out) throws Exception {
TimerService timerService = context.timerService();
if (context.timestamp() > timerService.currentWatermark()) {
out.collect(event);
}
}
}
Having done this, your question about the initial watermark becomes relevant. With periodic watermarks, the initial watermark is Long.MIN_VALUE, so nothing will be considered late until the first watermark is emitted, which will happen after 10 seconds of operation (given how you've set the auto-watermarking interval).
The relevant code is here if you want to see how periodic watermarks are generated in more detail.
If you want to avoid processing late elements during the first 10 seconds, you could simply forget about using event time and watermarking entirely, and simply modify the processElement method shown above to compare the event timestamps to System.currentTimeMillis() - maxTimeLag rather than to the current watermark. Another solution would be to use punctuated watermarking, and emit a watermark with the very first event.
Or even more simply, you could detect and drop late events in a flatMap or filter, since you are defining lateness relative to System.currentTimeMillis() rather than to the watermarks.