Lets assume that I have a job with max.parallelism=4 and a RichFlatMapFunction which is working with MapState. What is the best way to create the MapStateDescriptor? into the RichFlatMapFunction which means that for each instance of this class I will have a descriptor, or create a single instance of the descriptor, for example: public static MapStateDescriptor descriptor in a single class and call it from the RichFlatMapFunction? Because doing it on this way I will have just one MapStateDescriptor instead of 4, or did I misunderstood something?
Kind regards!
A few points...
Since each of your RichFlatMapFunction sub-tasks can be running in a different JVM on a different server, how would they share a static MapStateDescriptor?
Note that Flink's "max parallelism" isn't the same as the default environment parallelism. In general you want to leave the max parallelism value alone, and (if necessary) set your environment parallelism equal to the number of slots in your cluster.
The MapStateDescriptor doesn't store state. It tells Flink how to create the state. In your RichFlatMapFunction operator's open() call is where you'll be creating the state using the state descriptor.
So net-net is don't bother using a static MapStateDescriptor, it won't help. Just create your state (as per many examples) in your open() method.
Related
We have an application that consumes events from a kafka source. The logic from processing each element needs to take into account the events that were previously received (having the same partition key), without using time for windowing. The first implementation used a GlobalWindow, with an AggregateFunction for keeping the current state information and a trigger that would always fire in onElement call. I am guessing that the alternative of using a KeyedProcessFunction that and holds the state in a ValueState object would be more adequate, since we are not really taking timing into account, nor using any custom triggering. Is this assumption correct and are there any downsides to either one of these approaces?
In prefer using a KeyedProcessFunction in cases like this. It puts all of the related logic into one object -- rather than having to coordinate what's going on in a GlobalWindow, an AggregateFunction, and a Trigger (and perhaps also an Evictor). I find this results in implementations that are more maintainable and testable, plus you have more straightforward control over state management.
I don't see any advantages to a solution based on windows.
We are aggregating some data for 1 minute which we then flush onto a file. The data itself is like a map where key is an object and value is also an object.
Since we need to flush the data together hence we are not doing any keyBy and hence are using windowAll.
The problem that we are facing is that we get better throughput if we use window function with ProcessAllWindowFunction and then aggregate in the process call vs when we use aggregate with window function. We are also seeing timeouts in state checkpointing when we use aggregate.
I tried to go through the code base and the only hypothesis I could think of is probably it is easier to checkpoint ListState that process will use vs the AggregateState that aggregate will use.
Is the hypothesis correct? Are we doing something wrong? If not, is there a way to improve the performance on aggregate?
Based on what you've said, I'm going to jump to some conclusions.
I assume you are using the RocksDB state backend, and are aggregating each incoming event into into some sort of collection. In that case, the RocksDB state backend is having to deserialize that collection, add the new event to it, and then re-serialize it -- for every event. This is very expensive.
When you use a ProcessAllWindowFunction, each incoming event is appended to a ListState object, which has a very efficient implementation -- the serialized bytes for the new event are simply appended (the list doesn't have to be deserialized and re-serialized).
Checkpoints are timing out because the throughput is so poor.
Switching to the FsStateBackend would help. Or use a ProcessAllWindowFunction. Or implement your own windowing with a KeyedProcessFunction, and then use ListState or MapState for the aggregation.
I am implementing a AggregateFunction to measure the duration between two events after .window(EventTimeSessionWindows.withGap(gap))
. After the second event is processed, the window is closed.
Will flink automatically checkpoint the state of the AggregateFunction so that existing data in the accumulator is not lost from restarting?
Since I am not sure about that. I tried to implement AggregatingState in a RichAggregateFunction:
class MyAgg extends RichAggregateFunction<IN, ACC, OUT>
AggregatingState requires AggregatingStateDescriptor. Its constructor has this signature:
String name,
AggregateFunction<IN, ACC, OUT> aggFunction,
Class<ACC> stateType) {
I am very confused by the aggFunction. What should be put here? Isn't it the MyAgg that I am trying to define in the first place?
An AggregateFunction doesn't have any state. But the aggregating state used in a streaming window (and manipulated by an AggregateFunction) is checkpointed as part of the window's state.
A RichAggregateFunction cannot be used in a window context, and an AggregateFunction cannot have its own state. It's designed this way because if an AggregateFunction were allowed to use a state descriptor to define ValueState, for example, then that state wouldn't be mergeable -- and to keep the Window API reasonably clean, all window state needs to be mergeable (for the sake of session windows).
AggregatingState is something you might use in a KeyedProcessFunction, for example. In that context, you need to define how elements are to be aggregated into the accumulator (i.e., the AggregatingState), which you do with an AggregateFunction.
I'm trying to implement messaging scenario using apache flink stateful functions.
One of my state is able to updated by two different functions which is provided to MatchBinder. These two functions basically checks the current state and updates the state accordingly.
What happens if these two functions are called concurrently for the same key?
Is there a queue mechanism for stateful functions called for the same key?
Can we lock the state access/update for sequential access ?
What happens if these two functions are called concurrently for the
same key?
The MatchBinder is basically a convenient way to write a single StateFun function, that starts its execution by first matching the type (or properties) of the incoming message. It is basically a way to avoid writing code like this:
...
if (message instanceof A) {
handleA((A) message);
} else if (message instanceof B) {
handleB((B) message);
}
...
So in reality, although you are providing "different" Java functions to each bind case, this is the same StateFun function being invoked and the correct bind case would be selected.
Is there a queue mechanism for stateful functions called for the same
key?
Yes, StateFun functions would be invoked sequentially per address. While a function is applied for a specific address, no other message for that address would be applied concurrently. This comes almost for free, thanks to having Apache Flink as the actual runtime.
Can we lock the state access/update for sequential access ?
State access and modifications are atomic and sequential per address.
Let's say I need to implemnt a custom sink using RichSinkFunction, and I need some variables like DBConnection in the sink. Where should I initialize the DBConnection? I see most of the articles init the DBConnection in the open() method, why not in the constructor?
A folow up questions is what kind of variables should be inited in constructor and what should be init in open()?
The constructor of a RichFunction is only invoked on client side. If something needs to be actually performed on the cluster, it should be done in open.
open also needs to be used if you want to access parameters to your Flink job or RuntimeContext (for state, counters, etc.). When you use open, you also want to use close in symmetric fashion.
So to answer your question: your DBConnection should be initialized in open only. In constructor, you usually just store job-constant parameters in fields, such as how to access the key of your records if your sink can be reused across multiple projects with different data structures.