Flink checkpointing state for non-keyed stream - apache-flink

I am new to Flink. I am trying to enable checkpointing and stateful in my application. I saw how we store keyed state from the Flink documents. But I am wondering can we store non-keyed state (state for ProcessFunction)

It's somewhat unusual to need non-keyed state, but there is documentation with examples.
In user code this is generally only needed for implementing custom sources and sinks, which is why the examples focus on those use cases. But in a ProcessFunction you would do the same, i.e., implement the CheckpointedFunction interface (i.e., the initializeState and snapshotState methods).
The only types of non-keyed state are ListState, UnionState, and BroadcastState, and ListState is probably the type you want to use. UnionState is very similar to ListState, it just uses a different strategy for redistributing state during rescaling (each parallel instance gets the entire list, instead of being assigned a slice of the list, and the instances are responsible for knowing what to do). BroadcastState is what's used by a BroadcastProcessFunction or KeyedBroadcastProcessFunction.

Related

Flink AggregateFunction vs KeyedProcessFunction with ValueState

We have an application that consumes events from a kafka source. The logic from processing each element needs to take into account the events that were previously received (having the same partition key), without using time for windowing. The first implementation used a GlobalWindow, with an AggregateFunction for keeping the current state information and a trigger that would always fire in onElement call. I am guessing that the alternative of using a KeyedProcessFunction that and holds the state in a ValueState object would be more adequate, since we are not really taking timing into account, nor using any custom triggering. Is this assumption correct and are there any downsides to either one of these approaces?
In prefer using a KeyedProcessFunction in cases like this. It puts all of the related logic into one object -- rather than having to coordinate what's going on in a GlobalWindow, an AggregateFunction, and a Trigger (and perhaps also an Evictor). I find this results in implementations that are more maintainable and testable, plus you have more straightforward control over state management.
I don't see any advantages to a solution based on windows.

Flink window aggregation with state

I would like to do a window aggregation with an early trigger logic (you can think that the aggregation is triggered either by window is closed, or by a specific event), and I read on the doc: https://ci.apache.org/projects/flink/flink-docs-release-1.12/dev/stream/operators/windows.html#incremental-window-aggregation-with-aggregatefunction
The doc mentioned that Note that using ProcessWindowFunction for simple aggregates such as count is quite inefficient. so the suggestion is to pair with incremental window aggregation.
My question is that AverageAggregate in the doc, the state is not saved anywhere, so if the application crashed, the averageAggregate will loose all the intermediate value, right?
So If that is the case, is there a way to do a window aggregation, still supports incremental aggregation, and has a state backend to recover from crash?
The AggregateFunction is indeed only describing the mechanism for combining the input events into some result, that specific class does not store any data.
The state is persisted for us by Flink behind the scene though, when we write something like this:
input
.keyBy(<key selector>)
.window(<window assigner>)
.aggregate(new AverageAggregate(), new MyProcessWindowFunction());
the .keyBy(<key selector>).window(<window assigner>) is indicating to Flink to hold a piece of state for us for each key and time bucket, and to call our code in AverageAggregate() and MyProcessWindowFunction() when relevant.
In case of crash or restart, no data is lost (assuming state backend are configured properly): as with other parts of Flink state, the state here will either be retrieved from the state backend or recomputed from first principles from upstream data.

One object flink operator (ex. Filter) or two objects in Apache Flink Job

I have Apache Flink Job with 4 input DataStreams (JSON messages) from separate Apache Kafka topics and I've only one object XFilterFunction - which does some filtering. I wrote some data pipeline logics (for primitive example):
FilterFunction<MyEvent> xFilter = new XFilterFunction();
inputDataStream1.filter(xFilter)
.name("Xfilter")
.uid("Xfilter");
inputDataStream2
.union(inputDataStream3)
//here some logics (map, process,...)
.filter(xFilter);
Is it good or bad practice to use one new object XFilterFunction in Job?
Or better to use two new objects XFilterFunction? (2 streams -> 2 new filter objects)
If you instantiate the class several times i.e.
inputDataStream1.filter(new XFilterFunction());
...
inputDataStream2.filter(new XFilterFunction());
there should be not problem. I'm not so sure if otherwise things like state or overridden contextual functions would show unwanted behaviour.
In case it's no specialization of RichFunction, maybe there's even just a pure function invocation happening via delegates, unfortunately I'm not that deep into Flink's internals to say, but with solution above, you should be safe.

Does Flink automatically checkpoint AggregateFunction's state and how to use AggregatingStateDescriptor?

I am implementing a AggregateFunction to measure the duration between two events after .window(EventTimeSessionWindows.withGap(gap))
. After the second event is processed, the window is closed.
Will flink automatically checkpoint the state of the AggregateFunction so that existing data in the accumulator is not lost from restarting?
Since I am not sure about that. I tried to implement AggregatingState in a RichAggregateFunction:
class MyAgg extends RichAggregateFunction<IN, ACC, OUT>
AggregatingState requires AggregatingStateDescriptor. Its constructor has this signature:
String name,
AggregateFunction<IN, ACC, OUT> aggFunction,
Class<ACC> stateType) {
I am very confused by the aggFunction. What should be put here? Isn't it the MyAgg that I am trying to define in the first place?
An AggregateFunction doesn't have any state. But the aggregating state used in a streaming window (and manipulated by an AggregateFunction) is checkpointed as part of the window's state.
A RichAggregateFunction cannot be used in a window context, and an AggregateFunction cannot have its own state. It's designed this way because if an AggregateFunction were allowed to use a state descriptor to define ValueState, for example, then that state wouldn't be mergeable -- and to keep the Window API reasonably clean, all window state needs to be mergeable (for the sake of session windows).
AggregatingState is something you might use in a KeyedProcessFunction, for example. In that context, you need to define how elements are to be aggregated into the accumulator (i.e., the AggregatingState), which you do with an AggregateFunction.

store aggregation result of a time-frame into Flink's state backend

I'm new to Apache Flink (1 day :) ), and have seen in few guides it save state.
by documentation, you can use:
memoryStateBacked
FsStateBackend
RocksDBStateBackend
nevertheless I couldn't find a sample code of reading/writing to this state backend.
Does that mean that it is for Flink's internal usage, or I can use as well?
meaning: can I store last day aggregations, reset Flink, and then read the cache again? (like you would do with Redis for example)
Flink's state backends are used for storing the current state of your operator.
There are examples and detailed explanation available here if you haven't seen already.
Essentially, the state is defined in the public void open(Configuration config) function
and then in the flatMap function you can access the state by calling mystate.value() and can also be updated by using mystate.update(newvalue)
Currently this is what you can do with states, but there is a new feature called QueryableState which is in progress FLINK-3779, which enables you to query the Flink's state outside Flink.
PS : I am not aware of how Redis handles state

Resources