Flink AggregateFunction vs KeyedProcessFunction with ValueState - apache-flink

We have an application that consumes events from a kafka source. The logic from processing each element needs to take into account the events that were previously received (having the same partition key), without using time for windowing. The first implementation used a GlobalWindow, with an AggregateFunction for keeping the current state information and a trigger that would always fire in onElement call. I am guessing that the alternative of using a KeyedProcessFunction that and holds the state in a ValueState object would be more adequate, since we are not really taking timing into account, nor using any custom triggering. Is this assumption correct and are there any downsides to either one of these approaces?

In prefer using a KeyedProcessFunction in cases like this. It puts all of the related logic into one object -- rather than having to coordinate what's going on in a GlobalWindow, an AggregateFunction, and a Trigger (and perhaps also an Evictor). I find this results in implementations that are more maintainable and testable, plus you have more straightforward control over state management.
I don't see any advantages to a solution based on windows.

Related

How to process an already available state based on an event comes in a different stream in flink

We are working on deriving the status of accounts based on the activity on it. We calculate and keep the expiryOn date(which says the tentative, future date on which account expires) based on the user activity on the account.
We have a manual date change event which gives a date based on which the status of the account is emitted as Expired.
I would like to know on what would be the best way to achieve this.
So, my question is since the date change event occurs in future when compared to the calculation of the expiryOn date, can the broadcasted state be a solution for this? If yes, please suggest the way.
Or, is there any better approaches like Table API to solve this problem?
Broadcast state is suitable in cases (like this one) where you need to either share information or invoke actions that aren't keyed, and so cannot be sent to one relevant instance.
If you need to store the broadcast state, keep in mind that each instance will store a copy of the broadcast state on the heap, and include that copy in its checkpoints.
If you are using context.applytokeyedstate, be careful to make changes to the keyed state that are deterministic -- otherwise, in the event of a failure and recovery at a point where some instances of the broadcast operator have applied the changes to keyed state, and other instances have not, you could end up with inconsistencies.

Flink windowAll aggregate than window process?

We are aggregating some data for 1 minute which we then flush onto a file. The data itself is like a map where key is an object and value is also an object.
Since we need to flush the data together hence we are not doing any keyBy and hence are using windowAll.
The problem that we are facing is that we get better throughput if we use window function with ProcessAllWindowFunction and then aggregate in the process call vs when we use aggregate with window function. We are also seeing timeouts in state checkpointing when we use aggregate.
I tried to go through the code base and the only hypothesis I could think of is probably it is easier to checkpoint ListState that process will use vs the AggregateState that aggregate will use.
Is the hypothesis correct? Are we doing something wrong? If not, is there a way to improve the performance on aggregate?
Based on what you've said, I'm going to jump to some conclusions.
I assume you are using the RocksDB state backend, and are aggregating each incoming event into into some sort of collection. In that case, the RocksDB state backend is having to deserialize that collection, add the new event to it, and then re-serialize it -- for every event. This is very expensive.
When you use a ProcessAllWindowFunction, each incoming event is appended to a ListState object, which has a very efficient implementation -- the serialized bytes for the new event are simply appended (the list doesn't have to be deserialized and re-serialized).
Checkpoints are timing out because the throughput is so poor.
Switching to the FsStateBackend would help. Or use a ProcessAllWindowFunction. Or implement your own windowing with a KeyedProcessFunction, and then use ListState or MapState for the aggregation.

Flink window aggregation with state

I would like to do a window aggregation with an early trigger logic (you can think that the aggregation is triggered either by window is closed, or by a specific event), and I read on the doc: https://ci.apache.org/projects/flink/flink-docs-release-1.12/dev/stream/operators/windows.html#incremental-window-aggregation-with-aggregatefunction
The doc mentioned that Note that using ProcessWindowFunction for simple aggregates such as count is quite inefficient. so the suggestion is to pair with incremental window aggregation.
My question is that AverageAggregate in the doc, the state is not saved anywhere, so if the application crashed, the averageAggregate will loose all the intermediate value, right?
So If that is the case, is there a way to do a window aggregation, still supports incremental aggregation, and has a state backend to recover from crash?
The AggregateFunction is indeed only describing the mechanism for combining the input events into some result, that specific class does not store any data.
The state is persisted for us by Flink behind the scene though, when we write something like this:
input
.keyBy(<key selector>)
.window(<window assigner>)
.aggregate(new AverageAggregate(), new MyProcessWindowFunction());
the .keyBy(<key selector>).window(<window assigner>) is indicating to Flink to hold a piece of state for us for each key and time bucket, and to call our code in AverageAggregate() and MyProcessWindowFunction() when relevant.
In case of crash or restart, no data is lost (assuming state backend are configured properly): as with other parts of Flink state, the state here will either be retrieved from the state backend or recomputed from first principles from upstream data.

Flink checkpointing state for non-keyed stream

I am new to Flink. I am trying to enable checkpointing and stateful in my application. I saw how we store keyed state from the Flink documents. But I am wondering can we store non-keyed state (state for ProcessFunction)
It's somewhat unusual to need non-keyed state, but there is documentation with examples.
In user code this is generally only needed for implementing custom sources and sinks, which is why the examples focus on those use cases. But in a ProcessFunction you would do the same, i.e., implement the CheckpointedFunction interface (i.e., the initializeState and snapshotState methods).
The only types of non-keyed state are ListState, UnionState, and BroadcastState, and ListState is probably the type you want to use. UnionState is very similar to ListState, it just uses a different strategy for redistributing state during rescaling (each parallel instance gets the entire list, instead of being assigned a slice of the list, and the instances are responsible for knowing what to do). BroadcastState is what's used by a BroadcastProcessFunction or KeyedBroadcastProcessFunction.

WPF and Active Objects

I have a collection of "active objects". That is, objects that need to preiodically update themselves. In turn, these objects should be used to update a WPF-based GUI.
In the past I would just have each object include it's own thread, but that only makes sense when working with a finite number of objects with well-defined life-cycles. Now I'm using objects that only exist when needed by a form so the life cycle is unpredicable. Also, I can have dozens of objects all making database and web service calls.
Under normal circumstances the update interval is 1 second, but it can take up to 30 seconds due to timeouts.
So, what design would you recommend?
You may use one dispatcher (scheduler) for all or group of active objects. Dispatcher can process high priority tasks at the first place then other ones.
You can see this article about the long-running active objects with code to find out how to do it. In additional I recommend to look at Half Sync/ Half Async pattern.
If you have questions - welcome.
I am not an expert, but I would just have the objects fire an event indicating when they've changed. The GUI can then refresh the necessary parts of itself (easy when using data binding and INotifyPropertyChanged) whenever it receives an event.
I'd probably try to generalize out some sort of data bus, if possible, and when objects are 'active' have them add themselves to a list of objects to be updated. I'd especially be tempted to use this pattern if the objects are backed by a database, as that way you can aggregate multiple queries, instead of having to do a single query per each object.
If there end up being no listeners for a specific object, no big deal, the data just goes nowhere.
The core updater code can then use a single timer (or multiple, or whatever is appropriate) to determine when to get updates. Doing this as more of a dataflow, and less of a 'state update' will probably save a lot of sanity in the end.

Resources