We have a job which all the user feature and information are stored in keyed state. Each user feature represents a state descriptor. But we are evolving our features so sometimes some features are abandoned in our next release/version because we will no longer declare the abandoned feature state's descriptor in our code. My question is how flink takes care of those abandoned state? Will it no longer restore those abandoned state automatically?
If you are using Flink POJOs or Avro types, then Flink will automatically migrate the types and state for you. Otherwise, it will not, and you could implement a custom serializer instead. Or you could use the State Processor API to clean things up.
Related
env
flink 1.7.1
kafka 1.0.1
I use Flink application in Streaming process.
Read topic from kafka and sink it to kafka new Topic.
When i change application with new version of code and deploy, it comes to application execution failure.
If i deploy the same group.id after changing the application code, could there be a conflict with previous state checkpoint information?
Yes, if you are trying to do a stateful upgrade of your Flink application, there are a few things that can cause it to fail.
The UIDs of the stateful operators are used to find the state for each operator. If you haven't set the UIDs, then if the job's topology has changed, state restore will fail because Flink won't be able to find the state. See the docs on Assigning Operator IDs for details.
If you have dropped a stateful operator, then you should run the new job while specifying -allowNonRestoredState.
If you have modified your data types, the job can fail when attempting to deserialize the state in the checkpoint or savepoint. Flink 1.7 did not have any support for automatic schema evolution or state migration. In more recent versions of Flink, if you stick to POJOs or Avro, this is handled automatically. Otherwise you need custom serializers.
If this doesn't help you figure out what's going wrong, please share the information from the logs showing the specific exception.
We are deploying a new Flink stream processing job and it's state (stores) need to be initialized with historical data and this data should be available in the state store before it starts processing any new application events. We don't want to significantly modify the Flink job to also load the historical data.
We considered writing another, separate Flink job to process the historical data, update it's state store and create a Savepoint and use this Savepoint to initialize the state in the main Flink job. Looks like State Processor API only works with DataSet API and wondering about any alternative solutions. Thanks.
The State Processor API is a good solution. It provides a sort of savepoint connector that you use in a DataSet job to read/modify/update the savepoints that you use in your DataStream jobs.
It's a pretty simple change (definitely not "significant") to support a -preload mode for your job, where the non-historical data sources get replaced by empty/non-terminating sources. I typically use counters to decide when state has been fully populated, then stop with a savepoint, and restart without the -preload option.
The external database consists of a set of rules for each key, these rules should be applied on each stream element in the Flink job. Because it is very expensive to make a DB call for each element and retrieve the rules, I want to fetch the rules from the database at initialization and store it in a local cache.
When rules are updated in the external database, a status change event is published to the Flink job which should be used to fetch the rules and refresh this cache.
What is the best way to achieve what I've described? I looked into keyed state but initializing all keys and refreshing the keys on update doesn't seem possible.
I think you can make use of BroadcastProcessFunction or KeyedBroadcastProcessFunction to achieve your use case. A detailed blog available here
In short: You can define the source such as Kafka or any other and then publish the rules to Kafka that you want the actual stream to consume. Connect the actual data stream and rules stream. Then the processBroadcastElement will stream the rules where you can update the state. Finally the updated state (rules) can be retrieved in the actual event streaming method processElement.
Points to consider: Broadcast state will be kept on the heap always, not in state store (RocksDB). So, it has to be small enough to fit in memory. Each slot will copy all of the broadcast state into its checkpoints, so all checkpoints and savepoints will have n (parallelism) copies of the broadcast state.
A few different mechanisms in Flink may be relevant to this use case, depending on your detailed requirements.
Broadcast State
Jaya Ananthram has already covered the idea of using broadcast state in his answer. This makes sense if the rules should be applied globally, for every key, and if you can find a way to collect and broadcast the updates.
Note that the Context in the processBroadcastElement() of a KeyedBroadcastProcessFunction method contains the method applyToKeyedState(StateDescriptor<S, VS> stateDescriptor, KeyedStateFunction<KS, S> function). This means you can register a KeyedStateFunction that will be applied to all states of all keys associated with the provided stateDescriptor.
State Processor API
If you want to bootstrap state in a Flink savepoint from a database dump, you can do that with this library. You'll find a simple example of using the State Processor API to bootstrap state in this gist.
Change Data Capture
The Table/SQL API supports Debezium, Canal, and Maxwell CDC streams, and Kafka upsert streams. This may be a solution. There's also flink-cdc-connectors.
Lookup Joins
Flink SQL can do temporal lookup joins against a JDBC database, with a configurable cache. Not sure this is relevant.
In essence David's answer summarizes it well. If you are looking for more detail: not long ago, I gave a webinar [1] on this topic including running code examples. [2]
[1] https://www.youtube.com/watch?v=cJS18iKLUIY
[2] https://github.com/knaufk/enrichments-with-flink
I am adding TTL to ValueState in one ProcessFunction in one of my Flink apps. The Flink app has multiple other kinds of state both in this one ProcessFunction and in other operators. I understand that adding TTL to ValueState makes it non-backwards compatible. However, I was wondering if I could use the AllowNonRestoredState option to restore the rest of the application's state from the snapshot and have Flink just skip restoring the state for the one ValueState I add TTL to? Essentially, I was hoping for a little more insight into what AllowedNonRestoredState does. From the docs, it seems like it only works in situations where state was dropped all together, not in cases where the state still exists but has been modified.
AllowedNonRestoredState simply allows a job to start from a state snapshot (a savepoint or checkpoint) that contains state that has nowhere to be restored to in the job being started. In other words, some state was dropped.
Instead of trying to get Flink to not restore state for a particular ValueState, you could leave the old ValueState alone, while also introducing a new ValueState (with state TTL). When reading the new ValueState, if it's null, you could then migrate forward the old value.
However, I think it would be preferable to do a complete, one-time migration using the State Processor API (as I proposed here).
I am using Apache Flink 1.9 and standart checkpoint/savepoint mechanism to FS.
And my question is about: what is the proper way to restore job from savepoint, if job's code was changed?
For example, after refactoring i rename few classes and after that i can't restore from old checkpoint.
I lose my data, and want to ask - what i can do in this cases?
All operators have uid and name
Shortly speaking: it depends.
As for the more elaborate explanation, it shouldn't generally be an issue if You have only reordered and renamed the classes, obviously as long as the UIDs have not changed. As for the refactoring, it may actually influence how the state is stored and thus may prevent from restoring it. In such case You can use the parameter --allowNonRestoredState, which should allow to restore the available states from savepoint and start clean ones. Keep in mind that this may not restore all the states. In general You shouldn't really refactor the operators once they are running, since it can effectively prevent restoring from savepoint.
It's worth noting that It may not be possible to restore from savepoint if you are using SQL, refer to FLINK-6966 issue.
I assume that You are dealing with Savepoints not externalized checkpoints, otherwise there are few things to have in mind especially when changing parallelism.
seems your state cannot be treated as POJOs (POJOs: classes that follow a certain bean-like pattern). When a user-defined data type can’t be recognized as a POJO type, it must be processed as GenericType and serialized with Kryo.
Currently, In Flink, schema evolution is supported only for POJO and Avro types. Therefore, if you care about schema evolution for the state, it is currently recommended to always use either Pojo or Avro for state data types.
Some docs FYI:
https://ci.apache.org/projects/flink/flink-docs-stable/dev/types_serialization.html
https://ci.apache.org/projects/flink/flink-docs-stable/dev/stream/state/schema_evolution.html