After reading flink's documentation and searching around, i couldn't entirely understand how flink's handles state in its windows.
Lets say i have an hourly tumbling window with an aggregation function that accumulate msgs into some java pojo or scala case class.
Will The size of that window be tied to the number of events entering that window in a single hour, or will it just be tied to the pojo/case class, as im accumalting the events into that object. (e.g if counting 10000 msgs into an integer, will the size be close to 10000 * msg size or size of an int?)
Also, if im using pojos or case classes, does flink handle the state for me (spills to disk if memory exhausted/saves state at check points etc) or must i use flink's state objects for that?
Thanks for your help!
The state size of a window depends on the type of function that you apply. If you apply a ReduceFunction or AggregateFunction, arriving data is immediately aggregated and the window only holds the aggregated value. If you apply a ProcessWindowFunction or WindowFunction, Flink collects all input records and applies the function when time (event or processing time depending on the window type) passes the window's end time.
You can also combine both types of functions, i.e., have an AggregateFunction followed by a ProcessWindowFunction. In that case, arriving records are immediately aggregated and when the window is closed, the aggregation result is passed as single value to the ProcessWindowFunction. This is useful because you have incremental aggregation (due to ReduceFunction / AggregateFunction) but also access to the window metadata like begin and end timestamp (due to ProcessWindowFunction).
How the state is managed depends on the chosen state backend. If you configure the FsStateBackend all local state is kept on the heap of the TaskManager and the JVM process is killed with an OutOfMemoryError if the state grows too large. If you configure the RocksDBStateBackend state is spilled to disk. This comes with de/serialization costs for every state access but gives much more storage for state.
Related
We are aggregating some data for 1 minute which we then flush onto a file. The data itself is like a map where key is an object and value is also an object.
Since we need to flush the data together hence we are not doing any keyBy and hence are using windowAll.
The problem that we are facing is that we get better throughput if we use window function with ProcessAllWindowFunction and then aggregate in the process call vs when we use aggregate with window function. We are also seeing timeouts in state checkpointing when we use aggregate.
I tried to go through the code base and the only hypothesis I could think of is probably it is easier to checkpoint ListState that process will use vs the AggregateState that aggregate will use.
Is the hypothesis correct? Are we doing something wrong? If not, is there a way to improve the performance on aggregate?
Based on what you've said, I'm going to jump to some conclusions.
I assume you are using the RocksDB state backend, and are aggregating each incoming event into into some sort of collection. In that case, the RocksDB state backend is having to deserialize that collection, add the new event to it, and then re-serialize it -- for every event. This is very expensive.
When you use a ProcessAllWindowFunction, each incoming event is appended to a ListState object, which has a very efficient implementation -- the serialized bytes for the new event are simply appended (the list doesn't have to be deserialized and re-serialized).
Checkpoints are timing out because the throughput is so poor.
Switching to the FsStateBackend would help. Or use a ProcessAllWindowFunction. Or implement your own windowing with a KeyedProcessFunction, and then use ListState or MapState for the aggregation.
The checkpoints for my Flink job are getting larger and larger. After drilling down into individual tasks, the keyed window function seems to be responsible for most of the size. How can I reduce this?
If you have a lot of state tied up in windows, there are several possibilities:
Using incremental aggregation (by using reduce or aggregate) can dramatically reduce your storage requirements. Otherwise each event is being copied into the list of events assigned to each window.
If you are aggregating over multiple timeframes, e.g., every minute and every 10 minutes, you can cascade these windows, so that the 10 minute windows are only consuming the output of the minute-long windows, rather than every event.
If you are using sliding windows, each event is being assigned to each of the overlapping windows. For example, if your windows are 2 minutes long and sliding by 1 second, each event is being copied into 120 windows. Incremental and/or pre-aggregation will help here (a lot!), or you may want to use a KeyedProcessFunction instead of a window in order to optimize your state footprint.
If you have keyed count windows, you could have keys for which the requisite batch size is never (or only very slowly) reached, leading to more and more partial batches sitting around in state. You could implement a custom Trigger that incorporates a timeout in addition to the count-based triggering so that these partial batches are eventually processed.
If you are using globalState in a ProcessWindowFunction, the globalState for stale keys will accumulate. You can use state TTL on the state descriptor for the globalState. Note: this is the only place where window state isn't automatically freed when windows are cleared.
Or it may simply be that your key space is growing over time, and there's really nothing that can be done except to scale up the cluster.
I would like to do a window aggregation with an early trigger logic (you can think that the aggregation is triggered either by window is closed, or by a specific event), and I read on the doc: https://ci.apache.org/projects/flink/flink-docs-release-1.12/dev/stream/operators/windows.html#incremental-window-aggregation-with-aggregatefunction
The doc mentioned that Note that using ProcessWindowFunction for simple aggregates such as count is quite inefficient. so the suggestion is to pair with incremental window aggregation.
My question is that AverageAggregate in the doc, the state is not saved anywhere, so if the application crashed, the averageAggregate will loose all the intermediate value, right?
So If that is the case, is there a way to do a window aggregation, still supports incremental aggregation, and has a state backend to recover from crash?
The AggregateFunction is indeed only describing the mechanism for combining the input events into some result, that specific class does not store any data.
The state is persisted for us by Flink behind the scene though, when we write something like this:
input
.keyBy(<key selector>)
.window(<window assigner>)
.aggregate(new AverageAggregate(), new MyProcessWindowFunction());
the .keyBy(<key selector>).window(<window assigner>) is indicating to Flink to hold a piece of state for us for each key and time bucket, and to call our code in AverageAggregate() and MyProcessWindowFunction() when relevant.
In case of crash or restart, no data is lost (assuming state backend are configured properly): as with other parts of Flink state, the state here will either be retrieved from the state backend or recomputed from first principles from upstream data.
I want to know how flink does the checkpoint of the window operator. How to ensure that it is exactly once when recovering? For example, saving the tuples in the current window and saving the progress of the current window processing. I want to know the detailed process of the window operator's checkpoint and recovery.
All of Flink's stateful operators participate in the same checkpointing mechanism. When instructed to do so by the checkpoint coordinator (part of the job manager), the task managers initiate a checkpoint in each parallel instance of every source operator. The sources checkpoint their offsets and insert a checkpoint barrier into the stream. This divides the stream into the parts before and after the checkpoint. The barriers flow through the graph, and each stateful operator checkpoints its state upon having processed the stream up to the checkpoint barrier. The details are described at the link shared by #bupt_ljy.
Thus these checkpoints capture the entire state of the distributed pipeline, recording offsets into the input queues as well as the state throughout the job graph that has resulted from having ingested the data up to that point. When a failure occurs, the sources are rewound, the state is restored, and processing is resumed.
Given that during recovery the sources are rewound and replayed, "exactly once" means that the state managed by Flink is affected exactly once, not that the stream elements are processed exactly once.
There's nothing particularly special about windows in this regard. Depending on the type of window function being applied, a window's contents are kept in an element of managed ListState, ReducingState, AggregatingState, or FoldingState. As stream elements arrive and are being assigned to a window, they are appended, reduced, aggregated, or folded into that state. Other components of the window API, including Triggers and ProcessWindowFunctions, can have state that is checkpointed as well. For example, CountTrigger using ReducingState to keep track of how many elements have been assigned to the window, adding one to the count as each element is added to the window.
In the case where the window function is a ProcessWindowFunction, all of the elements assigned to the window are saved in Flink state, and are passed in an Iterable to the ProcessWindowFunction when the window is triggered. That function iterates over the contents and produces a result. The internal state of the ProcessWindowFunction is not checkpointed; if the job fails during the execution of the ProcessWindowFunction the job will resume from the most recently completed checkpoint. This will involve rewinding back to a time before the window received the event that triggered the window firing (that event can't be included in the checkpoint because a checkpoint barrier following it can not have had its effect yet). Sooner or later the window will again reach the point where it is triggered and the ProcessWindowFunction will be called again -- with the same window contents it received the first time -- and hopefully this time it won't fail. (Note that I've ignored the case of processing-time windows, which do not behave deterministically.)
When a ProcessWindowFunction uses managed/checkpointed state, it is used to remember things between firings, not within a single firing. For example, a window that allows late events might want to store the result previously reported, and then issue an update for each late event.
I want to know how to checkpoint a window. For example, windowed wordcount:
DataStream<Tuple3<String, Long, Long>> counts =
// split up the lines in pairs (2-tuples) containing: (word,1)
text
.flatMap(new Tokenizer())
.assignTimestampsAndWatermarks(new timestamp())
.keyBy(0)
.timeWindow(Time.seconds(2))
.process(new CountFunction())
Q1: What state should I save in CountFunction()? Do I need to save the buffer element of the window? Should I use ListState to store the buffered data in the window and use ValueState to store the current sum value?
Q2: When the fault occurs, how are the elements in the window handled? What happens when the window is restored?
Thank you for the help.
All of the state needed for Flink's windowing APIs is managed by Flink -- so you don't need to do anything. So long as checkpointing is enabled, the window buffer will be checkpointed and restored as needed.
Normally the CountFunction won't have any state that needs to be checkpointed. If the job fails while CountFunction is in the middle of iterating over the window's contents, the job will be rewound, and CountFunction will be called again with the same inputs.
If you do need to keep state in your CountFunction, then see Using per-window state in ProcessWindowFunction for information on how to go about that. It sounds like you will want to use globalState() (state that endures across all time), which you can access via the Context object passed to your process window function.
While you don't have a keyed stream, I suggest you use the keyed state mechanism described above. You can transform your non-keyed stream into a keyed stream by using keyBy with a constant key.