Why I got a NullPointerException when using initializeState() in Apache Flink? - apache-flink

I am using operator state with CheckpointedFuntion, however I encountered NullPointerException while initializing a MapState:
public void initializeState(FunctionInitializationContext context) throws Exception {
MapStateDescriptor<Long, Long> descriptor
= new MapStateDescriptor<>(
"state",
TypeInformation.of(new TypeHint<Long>() {}),
TypeInformation.of(new TypeHint<Long>() {})
);
state = context.getKeyedStateStore().getMapState(descriptor);
}
I got the NullPointerException when I assign "descriptor" to getMapState()
Here is the stacktrace:
java.lang.NullPointerException
at fyp.Buffer.initializeState(Iteration.java:51)
at org.apache.flink.streaming.util.functions.StreamingFunctionUtils.tryRestoreFunction(StreamingFunctionUtils.java:178)
at org.apache.flink.streaming.util.functions.StreamingFunctionUtils.restoreFunctionState(StreamingFunctionUtils.java:160)
at org.apache.flink.streaming.api.operators.AbstractUdfStreamOperator.initializeState(AbstractUdfStreamOperator.java:96)
at org.apache.flink.streaming.api.operators.AbstractStreamOperator.initializeState(AbstractStreamOperator.java:259)
at org.apache.flink.streaming.runtime.tasks.StreamTask.initializeOperators(StreamTask.java:694)
at org.apache.flink.streaming.runtime.tasks.StreamTask.initializeState(StreamTask.java:682)
at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:253)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:718)
at java.lang.Thread.run(Thread.java:748)

I guess you're bumping into a NPE due to the fact you're attempting to access the KeyedStateStore documented here; but, since you haven't a keyed stream, there is no such state store available along your job.
Gets a handle to the system's key/value state. The key/value state is only accessible if the function is executed on a KeyedStream. On each access, the state exposes the value for the key of the element currently processed by the function. Each function may have multiple partitioned states, addressed with different names.
So if you implement CheckpointedFunction (documented here) on an unkeyed upstream (and you won't it) you should consider to access the operator state store
snapshotMetadata = context.getOperatorStateStore.getUnionListState(descriptor)
The operator state allows you to have one state per parallel instance of your job, conversely to the keyed state which each state instance depends on the keys produced by a keyed stream.
Note that in the above example we request .getUnionListState that will outcome all the parallel instances of your operator state (formatted as a list of states).
If you look for a concrete example you can give a shot to this source: it is an operator implementing an operator state.
At the end, if you need a keyed stream instead, so you might think to move your solution closer to keyed state Flink backend.

Related

How to delete state by name for all keys

I had flowing state:
public static final ValueStateDescriptor<String> MY_STATE_DESCRIPTOR =
new ValueStateDescriptor<>("myState", String.class);
static {
MY_STATE_DESCRIPTOR.setQueryable("QueryableMyState");
}
protected transient ValueState<String> myState;
#Override
public void open(Configuration parameters) {
myState = getRuntimeContext().getState(MY_STATE_DESCRIPTOR);
}
in my KeyedCoProcessFunction implementation. But I don't need it any more and I cannot find how to delete all entries from that "myState" if I don't know all the keys in that state.
I assume you have other state in this application that you don't want to lose.
A few options:
(1) Use the State Processor API to modify a savepoint. Only carry over the state you want to keep. Or use the State Processor API to dump out a list of all of the keys for which there is state, and then use that knowledge to clear it. See ReadRidesAndFaresSnapshot.java for an example showing how to use this API with state snapshots taken from this application.
(2) Temporarily turn the KeyedCoProcessFunction into a KeyedBroadcastProcessFunction with the same UID, and use the applyToKeyedState method to loop over all the keys and clear the state. (This is a somewhat hacky solution which I'm including just for fun.)
(3) Throw away all of your state and start over.
Can state TTL achieve the same effect? A time-to-live (TTL) can be assigned to the keyed state of any type. If a TTL is configured and a state value has expired, the stored value will be cleaned up on a best effort basis which is discussed in more detail below.

Flink checkpointing state for non-keyed stream

I am new to Flink. I am trying to enable checkpointing and stateful in my application. I saw how we store keyed state from the Flink documents. But I am wondering can we store non-keyed state (state for ProcessFunction)
It's somewhat unusual to need non-keyed state, but there is documentation with examples.
In user code this is generally only needed for implementing custom sources and sinks, which is why the examples focus on those use cases. But in a ProcessFunction you would do the same, i.e., implement the CheckpointedFunction interface (i.e., the initializeState and snapshotState methods).
The only types of non-keyed state are ListState, UnionState, and BroadcastState, and ListState is probably the type you want to use. UnionState is very similar to ListState, it just uses a different strategy for redistributing state during rescaling (each parallel instance gets the entire list, instead of being assigned a slice of the list, and the instances are responsible for knowing what to do). BroadcastState is what's used by a BroadcastProcessFunction or KeyedBroadcastProcessFunction.

How many instances of Flink Functions is created?

Assuming the following pipeline:
input.filter(new RichFilterFunction<MyPojo>() {
#Override
public boolean filter(MyPojo value) throws Exception {
return false;
}
});
How many instances of the above rich function will be created?
Per task with no exceptions
Per task, however all parallel tasks on a particular node share one instance, since they are part of one JVM instance
There will always be as many instances as the parallelism indicates. There are two reasons related to state for that:
If your function maintains a state, especially in a keyed context, a shared instance would cause unintended side effects.
In the early days, users liked to maintain their own state (e.g., remembering previous value). Even though, it's heavily discouraged, it would still be bad if Flink could not support that.

Example of raw vs managed state

I am trying to understand the difference between raw and managed state. From the docs:
Keyed State and Operator State exist in two forms: managed and raw.
Managed State is represented in data structures controlled by the
Flink runtime, such as internal hash tables, or RocksDB. Examples are
“ValueState”, “ListState”, etc. Flink’s runtime encodes the states and
writes them into the checkpoints.
Raw State is state that operators keep in their own data structures.
When checkpointed, they only write a sequence of bytes into the
checkpoint. Flink knows nothing about the state’s data structures and
sees only the raw bytes.
However, I have not found any example highlighting the difference. Can anyone provide a minimal example to make the difference clear in code?
Operator state is only used in Operator API which is intended only for power users and it's not as stable as the end-user APIs, which is why we rarely advertise it.
As an example, consider AbstractUdfStreamOperator, which represents an operator with an UDF. For checkpointing, the state of the UDF needs to be saved and on recovery restored.
#Override
public void snapshotState(StateSnapshotContext context) throws Exception {
super.snapshotState(context);
StreamingFunctionUtils.snapshotFunctionState(context, getOperatorStateBackend(), userFunction);
}
#Override
public void initializeState(StateInitializationContext context) throws Exception {
super.initializeState(context);
StreamingFunctionUtils.restoreFunctionState(context, userFunction);
}
At this point, the state could be serialized as just a byte blob. As long as the operator can restore the state by itself, the state can take an arbitrary shape.
However, coincidentally in the past, much of the operator states have also been (re-)implemented as managed state. So the line is more blurry in reality.

When to use transient, when not to in flink?

in this code, should i use transient?
when can i use transient?
what is the difference ?
need your help
private Map<String, HermesCustomConsumer> topicSourceMap = new ConcurrentHashMap();
private Map<TopicAndPartition, Long> currentOffsets = new HashMap<>();
private transient Map<TopicAndPartition, Long> restoredState;
TL;DR
If you use transient variable, you'd better instantiate it in open() method of operators which implemented Rich interface. Otherwise, declare the variable with an initial value at the same time.
The states you use here are called raw states managed by the user itself. Whether you should use transient modifier depending on serialization purpose. Before you submit the Flink job. The computation topology will be generated and distributed into Flink cluster. And operators including source and sink will instantiate with fields e.g, topicSourceMap in your code. Variables topicSourceMap and currentOffsets have been instantiated with constructor. While restoredState is a transient variable, thus no matter what initial value you assigned with, it will not be serialized and distributed into some task to execute. So you usually need to instanciate it in open() method of operator which implemented Rich interface. After this operator is deserialized in some task, open() method would be invoked into instantiate your own states.

Resources