Flink windowAll aggregate than window process? - apache-flink

We are aggregating some data for 1 minute which we then flush onto a file. The data itself is like a map where key is an object and value is also an object.
Since we need to flush the data together hence we are not doing any keyBy and hence are using windowAll.
The problem that we are facing is that we get better throughput if we use window function with ProcessAllWindowFunction and then aggregate in the process call vs when we use aggregate with window function. We are also seeing timeouts in state checkpointing when we use aggregate.
I tried to go through the code base and the only hypothesis I could think of is probably it is easier to checkpoint ListState that process will use vs the AggregateState that aggregate will use.
Is the hypothesis correct? Are we doing something wrong? If not, is there a way to improve the performance on aggregate?

Based on what you've said, I'm going to jump to some conclusions.
I assume you are using the RocksDB state backend, and are aggregating each incoming event into into some sort of collection. In that case, the RocksDB state backend is having to deserialize that collection, add the new event to it, and then re-serialize it -- for every event. This is very expensive.
When you use a ProcessAllWindowFunction, each incoming event is appended to a ListState object, which has a very efficient implementation -- the serialized bytes for the new event are simply appended (the list doesn't have to be deserialized and re-serialized).
Checkpoints are timing out because the throughput is so poor.
Switching to the FsStateBackend would help. Or use a ProcessAllWindowFunction. Or implement your own windowing with a KeyedProcessFunction, and then use ListState or MapState for the aggregation.

Related

Flink AggregateFunction vs KeyedProcessFunction with ValueState

We have an application that consumes events from a kafka source. The logic from processing each element needs to take into account the events that were previously received (having the same partition key), without using time for windowing. The first implementation used a GlobalWindow, with an AggregateFunction for keeping the current state information and a trigger that would always fire in onElement call. I am guessing that the alternative of using a KeyedProcessFunction that and holds the state in a ValueState object would be more adequate, since we are not really taking timing into account, nor using any custom triggering. Is this assumption correct and are there any downsides to either one of these approaces?
In prefer using a KeyedProcessFunction in cases like this. It puts all of the related logic into one object -- rather than having to coordinate what's going on in a GlobalWindow, an AggregateFunction, and a Trigger (and perhaps also an Evictor). I find this results in implementations that are more maintainable and testable, plus you have more straightforward control over state management.
I don't see any advantages to a solution based on windows.

Guava Cache as ValueState in Flink

I am trying to de-duplicate events in my Flink pipeline. I am trying to do that using guava cache.
My requirement is that, I want to de-duplicate over a 1 minute window. But at any given point I want to maintain not more than 10000 elements in the cache.
A small background on my experiment with Flink windowing:
Tumbling Windows: I was able to implement this using Tumbling windows + custom trigger. But the problem is, if an element occurs in the 59th minute and 61st minute, it is not recognized as a duplicate.
Sliding Windows: I also tried sliding window with 10 second overlap + custom trigger. But an element that came in the 55th second is part of 5 different windows and it is written to the sink 5 times.
Please let me know if I should not be seeing the above behavior with windowing.
Back to Guava:
I have Event which looks like this and a EventsWrapper for these events which looks like this. I will be getting a stream of EventsWrappers. I should remove duplicate Events across different EventsWrappers.
Example if I have 2 EventsWrappers like below:
[EventsWrapper{id='ew1', org='org1', events=[Event{id='e1',
name='event1'}, Event{id='e2', name='event2'}]},
EventsWrapper{id='ew2', org='org2', events=[Event{id='e1',
name='event1'}, Event{id='e3', name='event3'}]}
I should emit as output the following:
[EventsWrapper{id='ew1', org='org1', events=[Event{id='e1',
name='event1'}, Event{id='e2', name='event2'}]},
EventsWrapper{id='ew2', org='org2', events=[Event{id='e3', name='event3'}]}
i.e Making sure that e1 event is emitted only once assuming these two events are within the time and size requirements of the cache.
I created a RichFlatmap function where I initiate a guava cache and value state like this. And set the Guava cache in the value state like this. My overall pipeline looks like this.
But each time I try to update the guava cache inside the value state:
eventsState.value().put(eventId, true);
I get the following error:
java.lang.NullPointerException
at com.google.common.cache.LocalCache.hash(LocalCache.java:1696)
at com.google.common.cache.LocalCache.put(LocalCache.java:4180)
at com.google.common.cache.LocalCache$LocalManualCache.put(LocalCache.java:4888)
at events.piepline.DeduplicatingFlatmap.lambda$flatMap$0(DeduplicatingFlatmap.java:59)
at java.base/java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:176)
On further digging, I found out that the error is because the keyEquivalence inside the Guava cache is null.
I checked by directly setting on the Guava cache(not through state, but directly on the cache) and that works fine.
I felt this could be because, ValueState is not able to serialize GuavaCache. So I added a Serializer like this and registered it like this:
env.registerTypeWithKryoSerializer((Class<Cache<String,Boolean>>)(Class<?>)Cache.class, CacheSerializer.class);
But this din't help either.
I have the following questions:
Any idea what I might be doing wrong with the Guava cache in the above case.
Is what I am seeing with my Tumbling and Slinding windows implementation is what is expected or am I doing something wrong?
What will happen if I don't set the Guava Cache in ValueState, instead just use it as a plain object in the DeduplicatingFlatmap class and operate directly on the Guava Cache instead of operating through the ValueState? My understanding is, the Guava cache won't be part of the Checkpoint. So when the pipeline fails and restarts, the GuavaCahe would have lost all the values in it and it will be empty on restart. Is this understanding correct?
Thanks a lot in advance for the help.
See below.
These windows are behaving as expected.
Your understanding is correct.
Even if you do get it working, using a Guava cache as ValueState will perform very poorly, because RocksDB is going to deserialize the entire cache on every access, and re-serialize it on every update.
Moreover, it looks like you are trying to share a single cache instance across all of the orgs that happen to be multiplexed across a single flatmap instance. That's not going to work, because the RocksDB state backend will make a copy of the cache for each org (a side effect of the serialization involved).
Your requirements aren't entirely clear, but a deduplication query might help. But I'm thinking MapState in combination with timers in a KeyedProcessFunction is more likely to be the building block you need. Here's an example that might help you get started (but you'll be wanting to handle the timers differently).

Flink window aggregation with state

I would like to do a window aggregation with an early trigger logic (you can think that the aggregation is triggered either by window is closed, or by a specific event), and I read on the doc: https://ci.apache.org/projects/flink/flink-docs-release-1.12/dev/stream/operators/windows.html#incremental-window-aggregation-with-aggregatefunction
The doc mentioned that Note that using ProcessWindowFunction for simple aggregates such as count is quite inefficient. so the suggestion is to pair with incremental window aggregation.
My question is that AverageAggregate in the doc, the state is not saved anywhere, so if the application crashed, the averageAggregate will loose all the intermediate value, right?
So If that is the case, is there a way to do a window aggregation, still supports incremental aggregation, and has a state backend to recover from crash?
The AggregateFunction is indeed only describing the mechanism for combining the input events into some result, that specific class does not store any data.
The state is persisted for us by Flink behind the scene though, when we write something like this:
input
.keyBy(<key selector>)
.window(<window assigner>)
.aggregate(new AverageAggregate(), new MyProcessWindowFunction());
the .keyBy(<key selector>).window(<window assigner>) is indicating to Flink to hold a piece of state for us for each key and time bucket, and to call our code in AverageAggregate() and MyProcessWindowFunction() when relevant.
In case of crash or restart, no data is lost (assuming state backend are configured properly): as with other parts of Flink state, the state here will either be retrieved from the state backend or recomputed from first principles from upstream data.

Flink window state size and state management

After reading flink's documentation and searching around, i couldn't entirely understand how flink's handles state in its windows.
Lets say i have an hourly tumbling window with an aggregation function that accumulate msgs into some java pojo or scala case class.
Will The size of that window be tied to the number of events entering that window in a single hour, or will it just be tied to the pojo/case class, as im accumalting the events into that object. (e.g if counting 10000 msgs into an integer, will the size be close to 10000 * msg size or size of an int?)
Also, if im using pojos or case classes, does flink handle the state for me (spills to disk if memory exhausted/saves state at check points etc) or must i use flink's state objects for that?
Thanks for your help!
The state size of a window depends on the type of function that you apply. If you apply a ReduceFunction or AggregateFunction, arriving data is immediately aggregated and the window only holds the aggregated value. If you apply a ProcessWindowFunction or WindowFunction, Flink collects all input records and applies the function when time (event or processing time depending on the window type) passes the window's end time.
You can also combine both types of functions, i.e., have an AggregateFunction followed by a ProcessWindowFunction. In that case, arriving records are immediately aggregated and when the window is closed, the aggregation result is passed as single value to the ProcessWindowFunction. This is useful because you have incremental aggregation (due to ReduceFunction / AggregateFunction) but also access to the window metadata like begin and end timestamp (due to ProcessWindowFunction).
How the state is managed depends on the chosen state backend. If you configure the FsStateBackend all local state is kept on the heap of the TaskManager and the JVM process is killed with an OutOfMemoryError if the state grows too large. If you configure the RocksDBStateBackend state is spilled to disk. This comes with de/serialization costs for every state access but gives much more storage for state.

WPF and Active Objects

I have a collection of "active objects". That is, objects that need to preiodically update themselves. In turn, these objects should be used to update a WPF-based GUI.
In the past I would just have each object include it's own thread, but that only makes sense when working with a finite number of objects with well-defined life-cycles. Now I'm using objects that only exist when needed by a form so the life cycle is unpredicable. Also, I can have dozens of objects all making database and web service calls.
Under normal circumstances the update interval is 1 second, but it can take up to 30 seconds due to timeouts.
So, what design would you recommend?
You may use one dispatcher (scheduler) for all or group of active objects. Dispatcher can process high priority tasks at the first place then other ones.
You can see this article about the long-running active objects with code to find out how to do it. In additional I recommend to look at Half Sync/ Half Async pattern.
If you have questions - welcome.
I am not an expert, but I would just have the objects fire an event indicating when they've changed. The GUI can then refresh the necessary parts of itself (easy when using data binding and INotifyPropertyChanged) whenever it receives an event.
I'd probably try to generalize out some sort of data bus, if possible, and when objects are 'active' have them add themselves to a list of objects to be updated. I'd especially be tempted to use this pattern if the objects are backed by a database, as that way you can aggregate multiple queries, instead of having to do a single query per each object.
If there end up being no listeners for a specific object, no big deal, the data just goes nowhere.
The core updater code can then use a single timer (or multiple, or whatever is appropriate) to determine when to get updates. Doing this as more of a dataflow, and less of a 'state update' will probably save a lot of sanity in the end.

Resources