Too many timers cost too much time when checkpointing in Flink - apache-flink

I have a situation to do sliding count over large scale of messages using State and TimeService. The sliding size is one and the window size is larger than 10 hours. The problem I meet is the checkpointing takes a lot of time. In order to improve the performance we use the incremental checkpoints. But it is still slow when the system do the checkpoint. We figure out that the most of the time is used to serialize the timers which are used to clean data. We have a timer for each key and there are about 300M timers at all.
Any suggestion to solve this problem would be appreciated. Or we can do the count in another way?
————————————————————————————————————————————
I'd like to add some details to the situation. The sliding size is one event and the window size is more than 10 hours(There are about 300 events per second), we need to react on each event. So in this situation we did not use the windows provided by Flink. we use the keyed state to store the previous information instead. The timers is used in ProcessFunction to trigger the cleaning job of the old data. At last the number of the dinstinct keys is very large.

I think this should work:
Dramatically reduce the number of keys Flink is working with from 300M down to 100K (for example), by effectively doing something like keyBy(key mod 100000). Your ProcessFunction can then use a MapState (where the keys are the original keys) to store whatever it needs.
MapStates have iterators, which you can use to periodically crawl each of these maps to expire old items. Stick to the principle of having only one timer per key (per uberkey, if you will), so that you only have 100K timers.
UPDATE:
Flink 1.6 included FLINK-9485, which allows timers to be checkpointed asynchronously, and to be stored in RocksDB. This makes it much more practical for Flink applications to have large numbers of timers.

What about if instead of using timers you add an extra field to every element of your stream to store the current processing time or the arrival time? So once you want to clean old data from your stream, you just have to use a filter operator and check if the data it's old engouh to be deleted.

Rather than registering a clearing timer on each event, how about you register a timer only once per some period e.g. once per 1 minute? You could register it only the first time a key is seen, plus refresh it in onTimer. Sth like:
new ProcessFunction<SongEvent, Object>() {
...
#Override
public void processElement(
SongEvent songEvent,
Context context,
Collector<Object> collector) throws Exception {
Boolean isTimerRegistered = state.value();
if (isTimerRegistered != null && !isTimerRegistered) {
context.timerService().registerProcessingTimeTimer(time);
state.update(true);
}
// Standard processing
}
#Override
public void onTimer(long timestamp, OnTimerContext ctx, Collector<Object> out)
throws Exception {
pruneElements(timestamp);
if (!elements.isEmpty()) {
ctx.timerService().registerProcessingTimeTimer(time);
} else {
state.clear();
}
}
}
Something similar is implemented for Flink SQL Over clause. You can have a look here

Related

FLINK forBoundedOutOfOrderness + CEP

I m trying to implement a CEP Pattern on FLINK on an out of order stream events.
My Stream is built in this way:
DataStream<DataInput> input = inputStream.flatMap(
new FlatMapFunction<String, DataInput>() {
#Override
public void flatMap(String value, Collector<DataInput> out) throws Exception {
for(DataInput input : JsonUtilsJackson.getInstance().initTrackingDataFromJson(value)) {
//One input can generate multiple DataInput
out.collect(input);
}
}
})
// Elements can be lately sent
.assignTimestampsAndWatermarks(WatermarkStrategy.Tracking>forBoundedOutOfOrderness(Duration.ofSeconds(10))
//Timestamp is not based on Kinesis but on data timestamp
.withTimestampAssigner((event, timestamp) -> event.getGeneratedDate().toEpochSecond()))
//CEP by KEY
.keyBy(requestId -> requestId.getTrackingData().getEntityReference());
And my pattern is linked to my Stream by the below code:
SingleOutputStreamOperator<DataOutput> enterStream = CEP.pattern(
input,
PatternStrategy.getPattern()
).process(new SpecificProcess());
My understanding of forBoundedOutOfOrderness is that if an element is injected at 11:01:00 with generatedDate field = 10:00:00, it will accept all elements with a generatedDate field between 09:59:50 and 10:00:00 and it will sort in an ascending mode.
The thing I don't understand is how to manage the periodic check of the watermark. Because this one does not depend of my Kinesis timestamp reading (11:01:00 int my exemple), how Flink will trigger the fact that he does not have to wait anymore, is that link to watermark periodic generation + out of orderness?
During my tests, the pattern is launched only one time and never launched after.
By debugging I see in CepOperator.onEventTime that events are well buffered but their timestamp is always <= timerService.currentWaterMark().
So, if someone has an explanation, it will help me. Thanks.
By the way, is there a way to have a watermark by KeyedStream, my different entitites has not the same lifetime and I miss some events.
Your question isn't entirely clear, but perhaps the information below will help you.
That role that watermarks play is that they sit at a particular spot in the stream, and mark that spot with a timestamp that indicates completeness -- at that spot in the stream, no further events are expected with timestamps less than the one in the watermark.
Watermarks don't sort the stream, but they can be used for sorting. This is what CEP does when it is used in event time mode.
forBoundedOutOfOrderness is a watermark strategy that produces watermarks periodically (by default, every 200 msec). But the watermark will only advance if there have been new events since the last watermark that can be used as justification for a larger watermark (i.e., at least one event with a larger timestamp).
Flink does not support per-key watermarking. But the FlinkKinesisConsumer supports per-shard watermarking, which may help. This will cause the shards with the most lag to hold back the watermarks, and this will avoid there being so many late events. And if you use a separate shard for each key, then you will have something similar to per-kay watermarking.

Why are my Flink windows using so much state?

The checkpoints for my Flink job are getting larger and larger. After drilling down into individual tasks, the keyed window function seems to be responsible for most of the size. How can I reduce this?
If you have a lot of state tied up in windows, there are several possibilities:
Using incremental aggregation (by using reduce or aggregate) can dramatically reduce your storage requirements. Otherwise each event is being copied into the list of events assigned to each window.
If you are aggregating over multiple timeframes, e.g., every minute and every 10 minutes, you can cascade these windows, so that the 10 minute windows are only consuming the output of the minute-long windows, rather than every event.
If you are using sliding windows, each event is being assigned to each of the overlapping windows. For example, if your windows are 2 minutes long and sliding by 1 second, each event is being copied into 120 windows. Incremental and/or pre-aggregation will help here (a lot!), or you may want to use a KeyedProcessFunction instead of a window in order to optimize your state footprint.
If you have keyed count windows, you could have keys for which the requisite batch size is never (or only very slowly) reached, leading to more and more partial batches sitting around in state. You could implement a custom Trigger that incorporates a timeout in addition to the count-based triggering so that these partial batches are eventually processed.
If you are using globalState in a ProcessWindowFunction, the globalState for stale keys will accumulate. You can use state TTL on the state descriptor for the globalState. Note: this is the only place where window state isn't automatically freed when windows are cleared.
Or it may simply be that your key space is growing over time, and there's really nothing that can be done except to scale up the cluster.

Some questions related to Fraud detection demo from Flink DataStream API

The example is very useful at first,it illustrates how keyedProcessFunction is working in Flink
there is something worth noticing, it suddenly came to me...
It is from Fraud Detector v2: State + Time part
It is reasonable to set a timer here, regarding the real application requirement part
override def onTimer(
timestamp: Long,
ctx: KeyedProcessFunction[Long, Transaction, Alert]#OnTimerContext,
out: Collector[Alert]): Unit = {
// remove flag after 1 minute
timerState.clear()
flagState.clear()
}
Here is the problem:
The TimeCharacteristic IS ProcessingTime which is determined by the system clock of the running machine, according to ProcessingTime property, the watermark will NOT be changed overtime, so that means onTimer will never be called, unless the TimeCharacteristic changes to eventTime
According the flink website:
An hourly processing time window will include all records that arrived at a specific operator between the times when the system clock indicated the full hour. For example, if an application begins running at 9:15am, the first hourly processing time window will include events processed between 9:15am and 10:00am, the next window will include events processed between 10:00am and 11:00am, and so on.
If the watermark doesn't change over time, will the window function be triggered? because the condition for a window to be triggered is when the watermark enters the end time of a window
I'm wondering the condition where the window is triggered or not doesn't depend on watermark in priocessingTime, even though the official website doesn't mention that at all, it will be based on the processing time to trigger the window
Hope someone can spend a little time on this,many thx!
Let me try to clarify a few things:
Flink provides two kinds of timers: event time timers, and processing time timers. An event time timer is triggered by the arrival of a watermark equal to or greater than the timer's timestamp, and a processing time timer is triggered by the system clock reaching the timer's timestamp.
Watermarks are only relevant when doing event time processing, and only purpose they serve is to trigger event time timers. They play no role at all in applications like the one in this DataStream API Code Walkthrough that you have referred to. If this application used event time timers, either directly, or indirectly (by using event time windows, or through one of the higher level APIs like SQL or CEP), then it would need watermarks. But since it only uses processing time timers, it has no use for watermarks.
BTW, this fraud detection example isn't using Flink's Window API, because Flink's windowing mechanism isn't a good fit for this application's requirements. Here we are trying to a match a pattern to a sequence of events within a specific timeframe -- so we want a different kind of "window" that begins at the moment of a special triggering event (a small transaction, in this case), rather than a TimeWindow (like those provided by Flink's Window API) that is aligned to the clock (i.e., 10:00am to 10:01am).

Apache Flink - How to Combine AssignerWithPeriodicWatermark and AssignerWithPunctuatedWatermark?

Usecase: using EventTime and extracted timestamp from records from Kafka.
myConsumer.assignTimestampsAndWatermarks(new MyTimestampEmitter());
...
stream
.keyBy("platform")
.window(TumblingEventTimeWindows 5 mins))
.aggregate(AggFunc(), WindowFunc())
.countWindowAll(size)
.apply(someFunc)
.addSink(someSink);
What I want: Flink extracts timestamp and emits watermark for each record for an initial interval (e.g. 20 seconds), then it can periodically emits watermark (e.g. each 10s).
Reason: If I used PeriodicWatermark, at the beginning Flink will emit watermark only after some interval and the count in my 1st window of 5 mins is wrong - much larger than the count in the subsequent windows. I had a workaround setting setAutoWatermarkInterval to 100ms but this is more than necessary.
Currently, I must use AssignerWithPeriodicWatermark or AssignerWithPunctuatedWatermark. How can i implement this approach of a combining strategy? Thanks.
Before doing something unusual with your watermark generator, I would double-check that you've correctly diagnosed the situation. By and large, event-time windows should behave deterministically, and always produce the same results if presented with the same input. If you are getting results for the first window that vary depending on how often watermarks are being produced, that indicates that you probably have late events that are being dropped when the watermarks arrive more frequently, and are able to be included when the watermarks are less frequent. Perhaps your watermarks aren't correctly accounting for the actual degree of out-of-orderness your events are experiencing? Or perhaps your watermarks are based on System.currentTimeMillis(), rather than the event timestamps?
Also, it's normal for the first time window to be different than the others, because time windows are aligned to the epoch, rather than the first event. Of course, this has the effect that the first window covers a shorter period of time than all of the others, so you should expect it to contain fewer events, not more.
Setting setAutoWatermarkInterval to 100ms is a perfectly normal thing to do. But if you really want to avoid this, you might consider an AssignerWithPunctuatedWatermarks that initially returns a watermark for every event, and then after a suitable interval, returns watermarks less often.
In a punctuated watermark assigner, both the extractTimestamp and checkAndGetNextWatermark methods are called for every event. You can use some transient (non-flink) state in the assigner to keep track of either the time of the first event, or to count events, and use that information in checkAndGetNextWatermark to eventually back off and stop producing watermarks for every event (by sometimes returning null from checkAndGetNextWatermark, rather than a Watermark). Your application will always revert back to generating watermarks for every event whenever it is restarted.
This will not yield an assigner with all of the characteristics of periodic and punctuated assigners, it's simply an adaptive punctuated assigner.

accessing Flink's system metrics in code to terminal rather than using any metrics reporter like JMX

I have used JMX as a metric reporter to get the Flink metrics, but is there any way to get it as output in the terminal?
I want to plot numRecordsInPerSecond for each operator for performance analysis, how can I do it?
I have seen some example of accumulators but It did not gave me the proper insight how do I do performance analysis of Flink. I will give you an example here
This is the execution plan of my Flink program, I have multiple questions, but I want to ask basic one
how can I measure the latency by each operator and then add it up to compute total latency for a Complex event.
how do I measure output throughput? Currently, I have written some code in select function which counts # of complex events seen and time Flink engine is up. Is this the best way to do it ?
But the basic question remains, which is how can I get the output for system metrics mentioned at Flink metrics via code to be shown in terminal output, as I want to plot graph for performance and the problem with JMX is that it shows me metrics on demand in the sense , I see the values as I click that particular metric in JMX console, which is not perfect fit for analyzing the system.
P.S - I have found one question asked at StackOverflow for computing throughput and latency and the answer was something like this
private static class MyMapper extends RichMapFunction<String, Object> {
private transient Meter meter;
#Override
public void open(Configuration parameters) throws Exception {
super.open(parameters);
this.meter = getRuntimeContext()
.getMetricGroup()
.meter("myMeter", new DropwizardMeterWrapper(new com.codahale.metrics.Meter()));
}
#Override
public Object map(String value) throws Exception {
this.meter.markEvent();
return value;
}
}
I have added above class in my code as well but haven't seen any output, and I also wonder how this code will show throughput or latency as we have not mentioned for which operator we want to find latency? For example, I want to find throughput for an operator somewhere in the middle of execution plan rather than at end of plan, will the above code do it for me?
You already have all the latency and Number of records per second In/Out at for each component listed out on the Flink Dashboard there is no need to implement an extra custom counter or metrics for calculating the records per second In/Out for each component.
And if you want to implement your own counter/Meter then you need this code and you have to map it to whichever component you are targetting.
To get the metric output in terminal maybe you can use SLF4j Metric Reporter.
Put this in your flink-conf.yaml file
metrics.reporter.slf4j.factory.class: org.apache.flink.metrics.slf4j.Slf4jReporterFactory
metrics.reporter.slf4j.interval: 1 SECONDS
And you can monitor the log using the terminal, for example:
tail -f taskmanager-log.log
Source: https://nightlies.apache.org/flink/flink-docs-release-1.15/docs/deployment/metric_reporters/#slf4j

Resources