Apache Flink States in ProcessWindowFunction

Apache Flink States in ProcessWindowFunction - apache-flink

I try to understand the difference of various states that can be used in ProcessWindowFunction.
First, ProcessWindowFunction is an AbstractRichFunction
abstract class ProcessWindowFunction[IN, OUT, KEY, W <: Window]
extends AbstractRichFunction {...}
As such it can use the method
public RuntimeContext getRuntimeContext()
to get a state
getRuntimeContext().getState
Morevoer, process function of WindowProcessFunction
def process(key: KEY, context: Context, elements: Iterable[IN], out:
Collector[OUT]) {}
has a context from where again two methods allow me to get states:
/**
* State accessor for per-key and per-window state.
*/
def windowState: KeyedStateStore
/**
* State accessor for per-key global state.
*/
def globalState: KeyedStateStore
Here my questions:
1) How are these related to getRuntimeContext().getState?
2) I often use a custom Trigger implementation and a GlobalWindow. In this case the state is retrieved with getPartitionedState. Can I access a window state defined in the WindowProcessFunction also in the trigger function? If so how?
3) There is no open method in the Trigger class to override, how is the state creation handled? Is it safe to just call getPartitionedState, which also manages state creation?

List item getRuntimeContext().getState calls are equivalent to globalState of a ProcessWindowFunction.Context. Both are "global" states, opposed to "window" states of windowState. "global" meaning that the state is shared across all of the windows having the same key. windowState is separate per each window, even for the same key. Keep in mind that even "global" state is NOT shared across different keys.
It seems to me that TriggerContext#getPartitionedState() and ProcessWindowFunction.Context#globalState() are pointing to the same thing.
Basing on code and one example that I found (org.apache.flink.table.runtime.triggers.StateCleaningCountTrigger): yes, getPartitionedState() should handle creation of a state if it wasn't created before.

Related

How to delete state by name for all keys

I had flowing state:
public static final ValueStateDescriptor<String> MY_STATE_DESCRIPTOR =
new ValueStateDescriptor<>("myState", String.class);
static {
MY_STATE_DESCRIPTOR.setQueryable("QueryableMyState");
}
protected transient ValueState<String> myState;
#Override
public void open(Configuration parameters) {
myState = getRuntimeContext().getState(MY_STATE_DESCRIPTOR);
}
in my KeyedCoProcessFunction implementation. But I don't need it any more and I cannot find how to delete all entries from that "myState" if I don't know all the keys in that state.

I assume you have other state in this application that you don't want to lose.
A few options:
(1) Use the State Processor API to modify a savepoint. Only carry over the state you want to keep. Or use the State Processor API to dump out a list of all of the keys for which there is state, and then use that knowledge to clear it. See ReadRidesAndFaresSnapshot.java for an example showing how to use this API with state snapshots taken from this application.
(2) Temporarily turn the KeyedCoProcessFunction into a KeyedBroadcastProcessFunction with the same UID, and use the applyToKeyedState method to loop over all the keys and clear the state. (This is a somewhat hacky solution which I'm including just for fun.)
(3) Throw away all of your state and start over.

Can state TTL achieve the same effect? A time-to-live (TTL) can be assigned to the keyed state of any type. If a TTL is configured and a state value has expired, the stored value will be cleaned up on a best effort basis which is discussed in more detail below.

How to clear the whole MapSate state with only one call

I know that if I do mapState.clear() I will be able to clean all the values into the state for the specific key, but my question is: Is there a way to do something like mapState.clear() and clean all the states into the mapStates with just one call? will be something like mapState.isEmpty() it will say "true" because all the keys into the mapState were cleaned up, not just for the current key.
Thanks.
Kind regards!

Because we are talking about a situation with nested maps, it's easy to get our terminology confused. So let's put this question into the context of an example.
Suppose you have a stream of events about users, and inside a KeyedProcessFunction you are using a MapState<ATTR, VALUE> to maintain a map of attribute/value pairs for each user:
userEvents
.keyBy(e -> e.userId)
.process(new ManageUserData())
Inside the process function, any time you are working with MapState you can only manipulate the one map for the user corresponding to the event being processed,
public static class ManageUserData extends KeyedProcessFunction<...> {
MapState<ATTR, VALUE> userMap;
}
so userMap.clear() will clear the entire map of attribute/value pairs for one user, but leave the other maps alone.
I believe you are asking if there's some way to clear all of the MapStates for all users at once. And yes, there is a way to do this, though it's a bit obscure and not entirely straightforward to implement.
If you change the KeyedProcessFunction in this example to a KeyedBroadcastProcessFunction, and connect a broadcast stream to the stream of user events, then in that KeyedBroadcastProcessFunction you can use KeyedBroadcastProcessFunction.Context.html#applyToKeyedState inside of the processBroadcastElement() method to iterate over all of the users, and for each user, clear their MapState.
You will have to arrange to send an event on the broadcast stream whenever you want this to happen.
You should pay attention to the warnings in the documentation regarding working with broadcast state. And keep in mind that the logic implemented in processBroadcastElement() must have the same deterministic behavior across all parallel instances.

Does Flink automatically checkpoint AggregateFunction's state and how to use AggregatingStateDescriptor?

I am implementing a AggregateFunction to measure the duration between two events after .window(EventTimeSessionWindows.withGap(gap))
. After the second event is processed, the window is closed.
Will flink automatically checkpoint the state of the AggregateFunction so that existing data in the accumulator is not lost from restarting?
Since I am not sure about that. I tried to implement AggregatingState in a RichAggregateFunction:
class MyAgg extends RichAggregateFunction<IN, ACC, OUT>
AggregatingState requires AggregatingStateDescriptor. Its constructor has this signature:
String name,
AggregateFunction<IN, ACC, OUT> aggFunction,
Class<ACC> stateType) {
I am very confused by the aggFunction. What should be put here? Isn't it the MyAgg that I am trying to define in the first place?

An AggregateFunction doesn't have any state. But the aggregating state used in a streaming window (and manipulated by an AggregateFunction) is checkpointed as part of the window's state.
A RichAggregateFunction cannot be used in a window context, and an AggregateFunction cannot have its own state. It's designed this way because if an AggregateFunction were allowed to use a state descriptor to define ValueState, for example, then that state wouldn't be mergeable -- and to keep the Window API reasonably clean, all window state needs to be mergeable (for the sake of session windows).
AggregatingState is something you might use in a KeyedProcessFunction, for example. In that context, you need to define how elements are to be aggregated into the accumulator (i.e., the AggregatingState), which you do with an AggregateFunction.

Why I got a NullPointerException when using initializeState() in Apache Flink?

I am using operator state with CheckpointedFuntion, however I encountered NullPointerException while initializing a MapState:
public void initializeState(FunctionInitializationContext context) throws Exception {
MapStateDescriptor<Long, Long> descriptor
= new MapStateDescriptor<>(
"state",
TypeInformation.of(new TypeHint<Long>() {}),
TypeInformation.of(new TypeHint<Long>() {})
);
state = context.getKeyedStateStore().getMapState(descriptor);
}
I got the NullPointerException when I assign "descriptor" to getMapState()
Here is the stacktrace:
java.lang.NullPointerException
at fyp.Buffer.initializeState(Iteration.java:51)
at org.apache.flink.streaming.util.functions.StreamingFunctionUtils.tryRestoreFunction(StreamingFunctionUtils.java:178)
at org.apache.flink.streaming.util.functions.StreamingFunctionUtils.restoreFunctionState(StreamingFunctionUtils.java:160)
at org.apache.flink.streaming.api.operators.AbstractUdfStreamOperator.initializeState(AbstractUdfStreamOperator.java:96)
at org.apache.flink.streaming.api.operators.AbstractStreamOperator.initializeState(AbstractStreamOperator.java:259)
at org.apache.flink.streaming.runtime.tasks.StreamTask.initializeOperators(StreamTask.java:694)
at org.apache.flink.streaming.runtime.tasks.StreamTask.initializeState(StreamTask.java:682)
at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:253)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:718)
at java.lang.Thread.run(Thread.java:748)

I guess you're bumping into a NPE due to the fact you're attempting to access the KeyedStateStore documented here; but, since you haven't a keyed stream, there is no such state store available along your job.
Gets a handle to the system's key/value state. The key/value state is only accessible if the function is executed on a KeyedStream. On each access, the state exposes the value for the key of the element currently processed by the function. Each function may have multiple partitioned states, addressed with different names.
So if you implement CheckpointedFunction (documented here) on an unkeyed upstream (and you won't it) you should consider to access the operator state store
snapshotMetadata = context.getOperatorStateStore.getUnionListState(descriptor)
The operator state allows you to have one state per parallel instance of your job, conversely to the keyed state which each state instance depends on the keys produced by a keyed stream.
Note that in the above example we request .getUnionListState that will outcome all the parallel instances of your operator state (formatted as a list of states).
If you look for a concrete example you can give a shot to this source: it is an operator implementing an operator state.
At the end, if you need a keyed stream instead, so you might think to move your solution closer to keyed state Flink backend.

in GObject of glib I confuse that whether the instance object of subclass that derive from parent class inherit properties of parent class or not?

first issue:
in GObject I confuse that whether the instance object of subclass deriving from parent class inherits properties of parent class or not?
second issue:
in GObject g_object_class_install_properties function adds properties into itself class in class initializer function,but in effect these properties for each instance object of class have a copy.in other words, each instance object of class have a copy of these properties.
in addition, I read GObject code snippet.
at below code in Gobject.c file：
class->set_property = g_object_do_set_property;
class->get_property = g_object_do_get_property;
firstly when are above functions called?
secondly if subclass derive from parent class subclass overrides these motheds (set_property and get_property),then if g_object_new creats new subclass instance and set properties value the set_property callback function is only called , whether after calling subclass set_property it calls set_property method of parent class or not ?
I don't know that after only calling at a time set_property of subclass,Is set_property method of parent class called at a time?
if you know these issues,please spend you time for anwsering my isses, thank you in advance very much.

If you have not yet seen the GNOME Developer site, it has several pages of useful information relevant to the questions you ask. The links pointed to below contain very simple example code, followed by very detailed descriptions of what happens in the code. The example pages I have cited (and linked) below address your questions specifically, but much more content on the topic is available in surrounding pages.
First issue:
Derivable types can be subclassed further, and their class and
instance structures form part of the public API which must not be
changed if API stability is cared about. They are declared using
G_DECLARE_DERIVABLE_TYPE:
See examples here:
G_DECLARE_DERIVABLE_TYPE()
Second issue:
generic get/set mechanism for object properties. When an object is
instantiated, the object's class_init handler should be used to
register the object's properties with
g_object_class_install_properties.
See examples here: Object properties
I believe your specific question:
when are above functions called?
is addressed is great detail in these, and surrounding paragraphs in the Object Properties link above:
If the user's GValue had been set to a valid value,
g_object_set_property would have proceeded with calling the object's
set_property class method. Here, since our implementation of Foo did
override this method, execution would jump to foo_set_property after
having retrieved from the GParamSpec the param_id [4] which had been
stored by g_object_class_install_property.
Once the property has been set by the object's set_property class
method, execution returns to g_object_set_property which makes sure
that the "notify" signal is emitted on the object's instance with the
changed property as parameter unless notifications were frozen by
g_object_freeze_notify.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

Apache Flink States in ProcessWindowFunction - apache-flink

Related

How to delete state by name for all keys

How to clear the whole MapSate state with only one call

Does Flink automatically checkpoint AggregateFunction's state and how to use AggregatingStateDescriptor?

Why I got a NullPointerException when using initializeState() in Apache Flink?

in GObject of glib I confuse that whether the instance object of subclass that derive from parent class inherit properties of parent class or not?

Categories

Resources