Flink window operator checkpointing - apache-flink

I want to know how flink does the checkpoint of the window operator. How to ensure that it is exactly once when recovering? For example, saving the tuples in the current window and saving the progress of the current window processing. I want to know the detailed process of the window operator's checkpoint and recovery.

All of Flink's stateful operators participate in the same checkpointing mechanism. When instructed to do so by the checkpoint coordinator (part of the job manager), the task managers initiate a checkpoint in each parallel instance of every source operator. The sources checkpoint their offsets and insert a checkpoint barrier into the stream. This divides the stream into the parts before and after the checkpoint. The barriers flow through the graph, and each stateful operator checkpoints its state upon having processed the stream up to the checkpoint barrier. The details are described at the link shared by #bupt_ljy.
Thus these checkpoints capture the entire state of the distributed pipeline, recording offsets into the input queues as well as the state throughout the job graph that has resulted from having ingested the data up to that point. When a failure occurs, the sources are rewound, the state is restored, and processing is resumed.
Given that during recovery the sources are rewound and replayed, "exactly once" means that the state managed by Flink is affected exactly once, not that the stream elements are processed exactly once.
There's nothing particularly special about windows in this regard. Depending on the type of window function being applied, a window's contents are kept in an element of managed ListState, ReducingState, AggregatingState, or FoldingState. As stream elements arrive and are being assigned to a window, they are appended, reduced, aggregated, or folded into that state. Other components of the window API, including Triggers and ProcessWindowFunctions, can have state that is checkpointed as well. For example, CountTrigger using ReducingState to keep track of how many elements have been assigned to the window, adding one to the count as each element is added to the window.
In the case where the window function is a ProcessWindowFunction, all of the elements assigned to the window are saved in Flink state, and are passed in an Iterable to the ProcessWindowFunction when the window is triggered. That function iterates over the contents and produces a result. The internal state of the ProcessWindowFunction is not checkpointed; if the job fails during the execution of the ProcessWindowFunction the job will resume from the most recently completed checkpoint. This will involve rewinding back to a time before the window received the event that triggered the window firing (that event can't be included in the checkpoint because a checkpoint barrier following it can not have had its effect yet). Sooner or later the window will again reach the point where it is triggered and the ProcessWindowFunction will be called again -- with the same window contents it received the first time -- and hopefully this time it won't fail. (Note that I've ignored the case of processing-time windows, which do not behave deterministically.)
When a ProcessWindowFunction uses managed/checkpointed state, it is used to remember things between firings, not within a single firing. For example, a window that allows late events might want to store the result previously reported, and then issue an update for each late event.

Related

Flink: What happens when ProcessAllWindowFunction takes more time than the TumblingProcessingTimeWindows defined in windowAll()

I have a ProcessAllWindowFunction implementation(refer AttributeBackLogEvents() in the code below) which has quite a few I/O and it might take more than 30seconds. windowAll() is windowing data using TumblingProcessingTimeWindows of 30seconds.
attributedStream
.windowAll(TumblingProcessingTimeWindows.of(Time.seconds(30)))
.process(new AttributeBackLogEvents())
.forceNonParallel()
.addSink(ConfluentKafkaSink.createKafkaSinkFromApplicationProperties())
.name("Enriched Event kafka topic sink");
AttributeBackLogEvents fetches a set of events from MySQL based on the iterable passed and after some processing deletes some of the fetched events of MySQL. I'm seeing that the record which cuurent window is fetching(and ideally which should be deleted before the next window fires), is also getting fetched by next window which means even though current window is processing next window fires up.
My questions are:
Is it possible that AttributeBackLogEvents is still running and next window fires?
If so, then how can i enforce that until current window processing is complete, next window shouldn't fire.
This Q does not describe what happens in the logic, but conceptually:
Your window means 'time range of source data' so any processing of that can never fully be done before the next window starts.
There may be a way, but for a streaming tool a source like MySQL is typically seen as reference data (which you typically want to read often) unless you are doing Change Data Capture.

How to know that the keyed window processing will start and has been finished

I have a keyed window stream processing application(KeyStream.window.process), and the window is a 15 minutes tumbling window.
I would like to know when a new window processing will start and when this window processing ends, so that I could use this chance to do some cleanup/initialize work globally.
For each window, before the processing kicks off, I would like to do some initializing work, such as truncate a db table (this operation should only occur in one place, this is a global operation that should not be done in the process method).
And when the processing window ends(all the process operator's tasks have been finished), I would like to do some other cleanup work (again, this is a global operation).
I would like to know whether is is possible in flink and how to do it, thanks!
I think you could accomplish this in an operator that follows the window, running with a parallelism of one. This operator will need to detect when a new batch of results begins to arrive from the window, and can do what's needed to close the previous window in the DB and initialize the new one at that time. It can also implement close() to do whatever wrap-up is needed if/when the job is ending or being shutdown.
Having done the initialization, this operator can simply forward on all of the events it receives from the window operator, until detecting the beginning of the next window's results.
This operator will need to keep one piece of managed state, namely some sort of identifier for the current window, so it can detect when a new window has begun. The results from the window will need to carry this identifier -- which could just be the window starting or ending timestamp.
You can used Flink's key partitioned state for this state -- you can simply key the stream by a constant. This is normally a bad idea, because it forces the effective parallelism to one (since every event will be assigned the same key), but as that's needed anyway by this (global) operator, that's not an issue.
Given these requirements, this operator could be a RichFlatMapFunction, or a KeyedProcessFunction. You'll need to use a KeyedProcessFunction if you find yourself wanting to use timers to do cleanup.

Flink window state size and state management

After reading flink's documentation and searching around, i couldn't entirely understand how flink's handles state in its windows.
Lets say i have an hourly tumbling window with an aggregation function that accumulate msgs into some java pojo or scala case class.
Will The size of that window be tied to the number of events entering that window in a single hour, or will it just be tied to the pojo/case class, as im accumalting the events into that object. (e.g if counting 10000 msgs into an integer, will the size be close to 10000 * msg size or size of an int?)
Also, if im using pojos or case classes, does flink handle the state for me (spills to disk if memory exhausted/saves state at check points etc) or must i use flink's state objects for that?
Thanks for your help!
The state size of a window depends on the type of function that you apply. If you apply a ReduceFunction or AggregateFunction, arriving data is immediately aggregated and the window only holds the aggregated value. If you apply a ProcessWindowFunction or WindowFunction, Flink collects all input records and applies the function when time (event or processing time depending on the window type) passes the window's end time.
You can also combine both types of functions, i.e., have an AggregateFunction followed by a ProcessWindowFunction. In that case, arriving records are immediately aggregated and when the window is closed, the aggregation result is passed as single value to the ProcessWindowFunction. This is useful because you have incremental aggregation (due to ReduceFunction / AggregateFunction) but also access to the window metadata like begin and end timestamp (due to ProcessWindowFunction).
How the state is managed depends on the chosen state backend. If you configure the FsStateBackend all local state is kept on the heap of the TaskManager and the JVM process is killed with an OutOfMemoryError if the state grows too large. If you configure the RocksDBStateBackend state is spilled to disk. This comes with de/serialization costs for every state access but gives much more storage for state.

Process elements after sinking to Destination

I am setting up a flink pipeline that reads from Kafka and sinks to HDFS. I want to process the elements after the addSink() step. This is because I want to setup trigger files indicating that writing data (to the sink) for a certain partition/hour is complete. How can this be achieved? Currently I am using the Bucketing sink.
DataStream messageStream = env
.addSource(flinkKafkaConsumer011);
//some aggregations to convert message stream to keyedStream
keyedStream.addSink(sink);
//How to process elements after 3.?
The Flink APIs do not support extending the job graph beyond the sink(s). (You can, however, fork the stream and do additional processing in parallel with writing to the sink.)
With the Streaming File Sink you can observe the part files transition to the finished state when they complete. See the JavaDoc for more information.
State lives within a single operator -- only that operator (e.g., a ProcessFunction) can modify it. If you want to modify the keyed value state after the sink has completed, there's no straightforward way to do that. One idea would be to add a processing time timer in the ProcessFunction that has the keyed state that wakes up periodically and checks for newly finished part files, and based on their existence, modifies the state. Or if that's the wrong granularity, write a custom source that does something similar and streams or broadcasts information into the ProcessFunction (which will then have to be a CoProcessFunction or a KeyedBroadcastProcessFunction) that it can use to do the necessary state updates.

Flink: Watermarking with Late Elements

I am doing real-time streaming in Flink where the Kafka is the message queue. I am applying EventTimeSlidingWindow of 120 sec. and slide of 1 sec. I am also inserting the watermark at each second of Event Time.
My concern is what happened if the element will come late, after the watermark? Now I my case, Flink simply discard the message which come after its respective watermark. Is there any mechanism provided by the filnk to handle such late message, like maintaining separate window? I have also gone through the documentation but I did not get clear about it.
Apache Flink has a concept called allowed lateness for the windows to handle data that arrives after a watermark.
By default, late elements are dropped when the watermark is past the end of the window. However, Flink allows to specify a maximum allowed lateness for window operators. Allowed lateness specifies by how much time elements can be late before they are dropped, and its default value is 0. Elements that arrive after the watermark has passed the end of the window but before it passes the end of the window plus the allowed lateness, are still added to the window. Depending on the trigger used, a late but not dropped element may cause the window to fire again. This is the case for the EventTimeTrigger.
In order to make this work, Flink keeps the state of windows until their allowed lateness expires. Once this happens, Flink removes the window and deletes its state.
Also another option is SideOoutput i.e. In addition to the main stream that results from DataStream operations, you can also produce any number of additional side output result streams. The type of data in the result streams does not have to match the type of data in the main stream and the types of the different side outputs can also differ. This operation can be useful when you want to split a stream of data where you would normally have to replicate the stream and then filter out from each stream the data that you don’t want to have.
When using side outputs, you first need to define an OutputTag that will be used to identify a side output stream:
https://ci.apache.org/projects/flink/flink-docs-release-1.3/dev/stream/side_output.html
Allowed lateness can result in multiple outputs. So end of window and end of watermark from the last even is one time and then for each element that’s late another aggregated output.

Resources