I had flowing state:
public static final ValueStateDescriptor<String> MY_STATE_DESCRIPTOR =
new ValueStateDescriptor<>("myState", String.class);
static {
MY_STATE_DESCRIPTOR.setQueryable("QueryableMyState");
}
protected transient ValueState<String> myState;
#Override
public void open(Configuration parameters) {
myState = getRuntimeContext().getState(MY_STATE_DESCRIPTOR);
}
in my KeyedCoProcessFunction implementation. But I don't need it any more and I cannot find how to delete all entries from that "myState" if I don't know all the keys in that state.
I assume you have other state in this application that you don't want to lose.
A few options:
(1) Use the State Processor API to modify a savepoint. Only carry over the state you want to keep. Or use the State Processor API to dump out a list of all of the keys for which there is state, and then use that knowledge to clear it. See ReadRidesAndFaresSnapshot.java for an example showing how to use this API with state snapshots taken from this application.
(2) Temporarily turn the KeyedCoProcessFunction into a KeyedBroadcastProcessFunction with the same UID, and use the applyToKeyedState method to loop over all the keys and clear the state. (This is a somewhat hacky solution which I'm including just for fun.)
(3) Throw away all of your state and start over.
Can state TTL achieve the same effect? A time-to-live (TTL) can be assigned to the keyed state of any type. If a TTL is configured and a state value has expired, the stored value will be cleaned up on a best effort basis which is discussed in more detail below.
Related
I know that if I do mapState.clear() I will be able to clean all the values into the state for the specific key, but my question is: Is there a way to do something like mapState.clear() and clean all the states into the mapStates with just one call? will be something like mapState.isEmpty() it will say "true" because all the keys into the mapState were cleaned up, not just for the current key.
Thanks.
Kind regards!
Because we are talking about a situation with nested maps, it's easy to get our terminology confused. So let's put this question into the context of an example.
Suppose you have a stream of events about users, and inside a KeyedProcessFunction you are using a MapState<ATTR, VALUE> to maintain a map of attribute/value pairs for each user:
userEvents
.keyBy(e -> e.userId)
.process(new ManageUserData())
Inside the process function, any time you are working with MapState you can only manipulate the one map for the user corresponding to the event being processed,
public static class ManageUserData extends KeyedProcessFunction<...> {
MapState<ATTR, VALUE> userMap;
}
so userMap.clear() will clear the entire map of attribute/value pairs for one user, but leave the other maps alone.
I believe you are asking if there's some way to clear all of the MapStates for all users at once. And yes, there is a way to do this, though it's a bit obscure and not entirely straightforward to implement.
If you change the KeyedProcessFunction in this example to a KeyedBroadcastProcessFunction, and connect a broadcast stream to the stream of user events, then in that KeyedBroadcastProcessFunction you can use KeyedBroadcastProcessFunction.Context.html#applyToKeyedState inside of the processBroadcastElement() method to iterate over all of the users, and for each user, clear their MapState.
You will have to arrange to send an event on the broadcast stream whenever you want this to happen.
You should pay attention to the warnings in the documentation regarding working with broadcast state. And keep in mind that the logic implemented in processBroadcastElement() must have the same deterministic behavior across all parallel instances.
I am trying to understand the difference between raw and managed state. From the docs:
Keyed State and Operator State exist in two forms: managed and raw.
Managed State is represented in data structures controlled by the
Flink runtime, such as internal hash tables, or RocksDB. Examples are
“ValueState”, “ListState”, etc. Flink’s runtime encodes the states and
writes them into the checkpoints.
Raw State is state that operators keep in their own data structures.
When checkpointed, they only write a sequence of bytes into the
checkpoint. Flink knows nothing about the state’s data structures and
sees only the raw bytes.
However, I have not found any example highlighting the difference. Can anyone provide a minimal example to make the difference clear in code?
Operator state is only used in Operator API which is intended only for power users and it's not as stable as the end-user APIs, which is why we rarely advertise it.
As an example, consider AbstractUdfStreamOperator, which represents an operator with an UDF. For checkpointing, the state of the UDF needs to be saved and on recovery restored.
#Override
public void snapshotState(StateSnapshotContext context) throws Exception {
super.snapshotState(context);
StreamingFunctionUtils.snapshotFunctionState(context, getOperatorStateBackend(), userFunction);
}
#Override
public void initializeState(StateInitializationContext context) throws Exception {
super.initializeState(context);
StreamingFunctionUtils.restoreFunctionState(context, userFunction);
}
At this point, the state could be serialized as just a byte blob. As long as the operator can restore the state by itself, the state can take an arbitrary shape.
However, coincidentally in the past, much of the operator states have also been (re-)implemented as managed state. So the line is more blurry in reality.
I try to understand the difference of various states that can be used in ProcessWindowFunction.
First, ProcessWindowFunction is an AbstractRichFunction
abstract class ProcessWindowFunction[IN, OUT, KEY, W <: Window]
extends AbstractRichFunction {...}
As such it can use the method
public RuntimeContext getRuntimeContext()
to get a state
getRuntimeContext().getState
Morevoer, process function of WindowProcessFunction
def process(key: KEY, context: Context, elements: Iterable[IN], out:
Collector[OUT]) {}
has a context from where again two methods allow me to get states:
/**
* State accessor for per-key and per-window state.
*/
def windowState: KeyedStateStore
/**
* State accessor for per-key global state.
*/
def globalState: KeyedStateStore
Here my questions:
1) How are these related to getRuntimeContext().getState?
2) I often use a custom Trigger implementation and a GlobalWindow. In this case the state is retrieved with getPartitionedState. Can I access a window state defined in the WindowProcessFunction also in the trigger function? If so how?
3) There is no open method in the Trigger class to override, how is the state creation handled? Is it safe to just call getPartitionedState, which also manages state creation?
List item getRuntimeContext().getState calls are equivalent to globalState of a ProcessWindowFunction.Context. Both are "global" states, opposed to "window" states of windowState. "global" meaning that the state is shared across all of the windows having the same key. windowState is separate per each window, even for the same key. Keep in mind that even "global" state is NOT shared across different keys.
It seems to me that TriggerContext#getPartitionedState() and ProcessWindowFunction.Context#globalState() are pointing to the same thing.
Basing on code and one example that I found (org.apache.flink.table.runtime.triggers.StateCleaningCountTrigger): yes, getPartitionedState() should handle creation of a state if it wasn't created before.
I am using operator state with CheckpointedFuntion, however I encountered NullPointerException while initializing a MapState:
public void initializeState(FunctionInitializationContext context) throws Exception {
MapStateDescriptor<Long, Long> descriptor
= new MapStateDescriptor<>(
"state",
TypeInformation.of(new TypeHint<Long>() {}),
TypeInformation.of(new TypeHint<Long>() {})
);
state = context.getKeyedStateStore().getMapState(descriptor);
}
I got the NullPointerException when I assign "descriptor" to getMapState()
Here is the stacktrace:
java.lang.NullPointerException
at fyp.Buffer.initializeState(Iteration.java:51)
at org.apache.flink.streaming.util.functions.StreamingFunctionUtils.tryRestoreFunction(StreamingFunctionUtils.java:178)
at org.apache.flink.streaming.util.functions.StreamingFunctionUtils.restoreFunctionState(StreamingFunctionUtils.java:160)
at org.apache.flink.streaming.api.operators.AbstractUdfStreamOperator.initializeState(AbstractUdfStreamOperator.java:96)
at org.apache.flink.streaming.api.operators.AbstractStreamOperator.initializeState(AbstractStreamOperator.java:259)
at org.apache.flink.streaming.runtime.tasks.StreamTask.initializeOperators(StreamTask.java:694)
at org.apache.flink.streaming.runtime.tasks.StreamTask.initializeState(StreamTask.java:682)
at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:253)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:718)
at java.lang.Thread.run(Thread.java:748)
I guess you're bumping into a NPE due to the fact you're attempting to access the KeyedStateStore documented here; but, since you haven't a keyed stream, there is no such state store available along your job.
Gets a handle to the system's key/value state. The key/value state is only accessible if the function is executed on a KeyedStream. On each access, the state exposes the value for the key of the element currently processed by the function. Each function may have multiple partitioned states, addressed with different names.
So if you implement CheckpointedFunction (documented here) on an unkeyed upstream (and you won't it) you should consider to access the operator state store
snapshotMetadata = context.getOperatorStateStore.getUnionListState(descriptor)
The operator state allows you to have one state per parallel instance of your job, conversely to the keyed state which each state instance depends on the keys produced by a keyed stream.
Note that in the above example we request .getUnionListState that will outcome all the parallel instances of your operator state (formatted as a list of states).
If you look for a concrete example you can give a shot to this source: it is an operator implementing an operator state.
At the end, if you need a keyed stream instead, so you might think to move your solution closer to keyed state Flink backend.
I'm new to Apache Flink (1 day :) ), and have seen in few guides it save state.
by documentation, you can use:
memoryStateBacked
FsStateBackend
RocksDBStateBackend
nevertheless I couldn't find a sample code of reading/writing to this state backend.
Does that mean that it is for Flink's internal usage, or I can use as well?
meaning: can I store last day aggregations, reset Flink, and then read the cache again? (like you would do with Redis for example)
Flink's state backends are used for storing the current state of your operator.
There are examples and detailed explanation available here if you haven't seen already.
Essentially, the state is defined in the public void open(Configuration config) function
and then in the flatMap function you can access the state by calling mystate.value() and can also be updated by using mystate.update(newvalue)
Currently this is what you can do with states, but there is a new feature called QueryableState which is in progress FLINK-3779, which enables you to query the Flink's state outside Flink.
PS : I am not aware of how Redis handles state