I have streaming job which listens to events, does operations on them using CEP.
Flow is
stream = source
.assignTimestampsAndWatermarks(...)
.filter(...);
CEP
.pattern(stream.keysBy(e-> e.getId()), pattern)
.process(PattenMatchProcessFunction)
.sink(...);
The keys are all short lived, and process function doesn't contains any state, to say state can be removed by setting ttl. Using EventTime characteristics
My question, how does flink handle the expired keys, would have any impact on the GC.
If flink removes the keys itself then at what frequency does this happen.
Facing GC issues, job is getting stuck after deploying for 3 hours.
Doing memory tuning, but want to eliminate this case.
FsStateBackend will hold the state in-memory for your CEP operator.
What Flink does for CEP is it buffers the elements in a MapState[Long, List[T]] which maps a timestamp to all elements that arrived for that time. Once a watermark occurs, Flink will process the buffered events as follows:
// 1) get the queue of pending elements for the key and the corresponding NFA,
// 2) process the pending elements in event time order and custom comparator if exists by feeding them in the NFA
// 3) advance the time to the current watermark, so that expired patterns are discarded.
// 4) update the stored state for the key, by only storing the new NFA and MapState iff they have state to be used later.
// 5) update the last seen watermark.
Once the events have been processed, Flink will advance the watermark which will cause old entries in the state to be expired (you can see this inside NFA.advanceTime). This means that eviction of elements in your depend on how often watermarks are being created and pushed through in your stream.
Related
I am trying to figure out a solution to the problem of watermarks progress when the number of Kafka partitions is larger than the Flink parallelism employed.
Consider for example that I have Flink app with parallelism of 3 and that it needs to read data from 5 Kafka partitions. My issue is that when starting the Flink app, it has to consume historical data from these partitions. As I understand it each Flink task starts consuming events from a corresponding partition (probably buffers a significant amount of events) and progress event time (therefore watermarks) before the same task transitions to another partition that now will have stale data according to watermarks already issued.
I tried considering a watermark strategy using watermark alignment of a few seconds but that
does not solve the problem since historical data are consumed immediately from one partition and therefore event time/watermark has progressed.Below is a snippet of code that showcases watermark strategy implemented.
WatermarkStrategy.forGenerator(ws)
.withTimestampAssigner(
(event, timestamp) -> (long) event.get("event_time))
.withIdleness(IDLENESS_PERIOD)
.withWatermarkAlignment(
GROUP,
Duration.ofMillis(DEFAULT_MAX_WATERMARK_DRIFT_BETWEEN_PARTITIONS),
Duration.ofMillis(DEFAULT_UPDATE_FOR_WATERMARK_DRIFT_BETWEEN_PARTITIONS));
I also tried using a downstream operator to sort events as described here Sorting union of streams to identify user sessions in Apache Flink but then again also this cannot effectively tackle my issue since event record times can deviate significantly.
How can I tackle this issue ? Do I need to have the same number of Flink tasks as the number of Kafka partitions or I am missing something regarding the way data are read from Kafka partitions
The easiest solution to this problem will be using the fromSource with WatermarkStrategy instead of assigning that by using assignTimestampsAndWatermarks.
When You use the WatermarkStrategy directly in fromSource with kafka connector, the watermarks will be partition aware, so the Watermark generated by the given operator will be minimum of all partitions assinged to this operator.
Assigning watermarks directly in source will solve the problem You are facing, but it has one main drawback, since the generated watermark in min of all partitions processed by the given operator, if some partition is idle watermark for this operator will not progress either.
The docs describe kafka connector watermarking here.
I'm using Flink with a kinesis source and event time keyed windows. The application will be listening to a live stream of data, windowing (event time windows) and processing each keyed stream. I have another use-case where i also need to be able to support backfill of older data for certain key streams (These will be new key streams with event-time < watermark).
Given that I'm using Watermarks, this poses to be a problem since Flink doesn't support per - key watermark. Hence any keyed stream for backfill will end up being ignored since the event time for this stream will be < application watermark maintained by the live stream.
I have gone through other similar questions but wasn't able to get a possible approach.
Here are possible approaches I'm considering but still have some open questions.
Possible Approach - 1
(i) Maintain a copy of the application specifically for backfill purpose. The backfill job will happen rarely (~ a few times a month). The stream of data sent to the application copy will have an indicator for start and stop in the stream. Using that I plan on starting / resetting the watermark.
Open Question ? Is it possible to reset the watermark using an indicator from the stream ? I understand that this is not best practise but can't think of an alternative solution.
Follow up to : Clear Flink watermark state in DataStream [No definitive solution provided.]
Possible Approach - 2
Have parallel instances for each key since its possible for having different watermark per task. -> Not going with this since i'll be having > 5k keyed streams.
Let me know if any other details are needed.
You can address this by running the backfill jobs in BATCH execution mode. When the DataStream API operates in batch mode, the input is bounded (finite), and known in advance. This allows Flink to sort the input by key and by timestamp, and the processing will proceed correctly according to event time without any concern for watermarks or late events.
I see that there are lot of discussions going on about adding support for watermarks per key. But do flink support per partition watermarks?
Currently - then minimum of all the watermarks(non idle partitions) is taken into account. Because of this the last hanging records in a window are stuck as well.(when incremented the watermark using periodicemit)
Any info on this is really appreciated!
Some of the sources, such as the FlinkKafkaConsumer, support per-partition watermarking. You get this by calling assignTimestampsAndWatermarks on the source, rather than on the stream produced by the source.
What this does is that each consumer instance tracks the maximum timestamp within each partition, and take as its watermark the minimum of these maximums, less the configured bounded out-of-orderness. Idle partitions will be ignored, if you configure it to do so.
Not only does this yield more accurate watermarking, but if your events are in-order within each partition, this also makes it possible to take advantage of the WatermarkStrategy.forMonotonousTimestamps() strategy.
See Watermark Strategies and the Kafka Connector for more details.
As for why the last window isn't being triggered, this is related to watermarking, but not to per-partition watermarking. The problem is simply that windows are triggered by watermarks, and the watermarks are trailing behind the timestamps in the events. So the watermarks can never catch up to the final events, and can never trigger the last window.
This isn't a problem for unbounded streaming jobs, since they never stop and never have a last window. And it isn't a problem for batch jobs, since they are aware of all of the data. But for bounded streaming jobs, you need to do something to work around this issue. Broadly speaking, what you must do is to inform Flink that the input stream has ended -- whenever the Flink sources detect that they have reached the end of an event-time-based input stream, they emit one last watermark whose value is MAX_WATERMARK, and this will trigger any open windows.
One way to do this is to use a KafkaDeserializationSchema with an implementation of isEndOfStream that returns true when the job reaches its end.
In Apache Flink, setAutoWatermarkInterval(interval) produces watermarks to downstream operators so that they advance their event time.
If the watermark has not been changed during the specified interval (no events arrived) the runtime will not emit any watermarks? On the other hand, if a new event is arrived before the next interval, a new watermark will be immediately emitted or it will be queued/waiting until the next setAutoWatermarkInterval interval is reached.
I am curious on what is the best configuration AutoWatermarkInterval (especially for high rate sources): the more this value is small, the more lag between processing time and event time will be small, but at the overhead of more BW usage to send the watermarks. Is that true accurate?
On the other hand, If I used env.setStreamTimeCharacteristic(TimeCharacteristic.IngestionTime), Flink runtime will automatically assign timestamps and watermarks (timestamps correspond to the time the event entered the Flink dataflow pipeline i.e. the source operator), nevertheless even with ingestionTime we can still define a processing time timer (in the processElement function) as show below:
long timer = context.timestamp() + Timeout.
context.timerService().registerProcessingTimeTimer(timer);
where context.timestamp() is the ingestion time set by Flink.
Thank you.
The autoWatermarkInterval only affects watermark generators that pay attention to it. They also have an opportunity to generate a watermark in combination with event processing.
For those watermark generators that use the autoWatermarkInterval (which is definitely the normal case), they are collecting evidence for what the next watermark should be as a side effect of assigning timestamps for each event. When a timer fires (based on the autoWatermarkInterval), the watermark generator is then asked by the Flink runtime to produce the next watermark. The watermark wasn't waiting somewhere, nor was it queued, but rather it is created on demand, based on information that had been stored by the timestamp assigner -- which is typically the maximum timestamp seen so far in the stream.
Yes, more frequent watermarks means more overhead to communicate and process them, and lower latency. You have to decide how to handle this throughput/latency tradeoff based on your application's requirements.
You can always use processing time timers, regardless of the TimeCharacteristic. (By the way, at a low level, the only thing watermarks do is to trigger event time timers, be they in process functions, windows, etc.)
My flink job reads from kafka consumer using FlinkKafkaConsumer010 and sinks into hdfs using CustomBucketingSink. We have series of transformations kafka -> flatmaps(2-3 transformations) -> keyBy -> tumblingWindow(5 mins) -> Aggregation -> hdfsSink. We have kafka input of 3 millions/min events on an average and around 20 millions/min events on peak time. Checkpointing duration and minimum pause between two checkpoiting is 3 mins and i am using FsStateBackend.
Here are my assumptions :
Flink consumes some fixed number of events from kafka(multiple offsets from multiple partitions at once) and waits till it reachs to sink and then checkpoints. In case of success it commits the kafka partitions offset it read and maintains some state related to hdfs file it was writting. While multiple transformations were going after kafka hand over events to other operators, kafka consumer sits idle until it gets confirmation for success for the events that it sent. So we can say while sink is writting data to hdfs all previous operators were sitting idle. In case of failure flink goes to previous checkpoint state and points to kafka last partition offset committed and points to hdfs file offest it should start writting to.
Here are my doubts based on above assumptions:
1) Is above assumption is correct.
2) Does it make sense for tumbling window to have state as in case of failure anyway we start from last kafka partition commited offset.
3) In case tumbling window make state, when will this state can be used by flink.
4) Why checkpoint and savepoint state size vary.
5) In case of any failure, flink always starts from sorce operator. Right ?
Your assumptions are not correct.
(1) Checkpointing does not depend in any way on events or results reaching the sink(s).
(2) Flink does its own Kafka offset management. When restoring from a checkpoint, after a failure, the offsets in the checkpoint are used, not those that may have been committed back to Kafka.
(3) No operators are ever idle in the way you've described. The pipeline is not stalled by checkpointing.
The best way to understand how checkpointing works is to go through the Flink operations playground, especially the section on Observing Failure and Recovery. This will give you a much clearer understanding of this topic, because you'll be able to observe exactly what's happening.
I can also recommend reading https://ci.apache.org/projects/flink/flink-docs-master/training/fault_tolerance.html, and following the links contained there.
But to walk through how checkpointing works in your application, here are the basic steps:
(1) When the checkpoint coordinator (part of the job manager) decides it's time to initiate another checkpoint, it informs each of the task managers to start checkpoint n.
(2) All of the sources instances checkpoint their own state, and insert checkpoint barrier n into their outgoing streams. In your case, the sources are Kafka consumers, and they checkpoint the current offset for each partition.
(3) Whenever the checkpoint barrier reaches the head of the input queue in a stateful operator, that operator checkpoints its state and forwards the barrier. This part has some complexity to it -- but basically, the state is held in a multi-version, concurrency controlled hash map. The operator creates a new version n+1 of the state that can be modified by the events behind the checkpoint barrier, and creates a new thread to asynchronously snapshot all the state in version n.
In your case, the window and sink are stateful. The window's state includes the current window contents, the state of the trigger, and other state you're using for window processing, if any.
(4) Sinks use the arrival of the barrier to flush any queued output, and commit pending transactions. Again, there's some complexity here, as transactional sinks use a two-phase commit protocol.
In your application, if the checkpoint interval is much smaller than the window duration, then the sink will complete many checkpoints before ever receiving any output from the window.
(5) When the checkpoint coordinator has heard back from every task that the checkpoint is complete, it finalizes the checkpoint metadata.
During recovery, the state of every operator is reset to the state in the most recent checkpoint. This means that the sources are rewound to the offsets in the checkpoint, and processing resumes with the state in the window and sink corresponding to what it should be after having consumed the events up to those offsets.
Note: To keep this reasonably simple, I've glossed over a bunch of details. Also, FLIP-76 will introduce a new approach to checkpointing.