env
flink 1.7.1
kafka 1.0.1
I use Flink application in Streaming process.
Read topic from kafka and sink it to kafka new Topic.
When i change application with new version of code and deploy, it comes to application execution failure.
If i deploy the same group.id after changing the application code, could there be a conflict with previous state checkpoint information?
Yes, if you are trying to do a stateful upgrade of your Flink application, there are a few things that can cause it to fail.
The UIDs of the stateful operators are used to find the state for each operator. If you haven't set the UIDs, then if the job's topology has changed, state restore will fail because Flink won't be able to find the state. See the docs on Assigning Operator IDs for details.
If you have dropped a stateful operator, then you should run the new job while specifying -allowNonRestoredState.
If you have modified your data types, the job can fail when attempting to deserialize the state in the checkpoint or savepoint. Flink 1.7 did not have any support for automatic schema evolution or state migration. In more recent versions of Flink, if you stick to POJOs or Avro, this is handled automatically. Otherwise you need custom serializers.
If this doesn't help you figure out what's going wrong, please share the information from the logs showing the specific exception.
Related
I am interested in processing large state using Flink.
To resolve this issue, there are some ways to handle it such as incremental checkpoint and others.
I understand its concept via the Flink document.
And also I found that there is change log statebackend which is introduced in Flink 1.16.
I think that chanage log state backend can reduce the time of taking snapshot by capturing the difference between current and previous one.
But I am a little bit confused and cannot fully understand the difference between incremental checkpoint and the change log state backend.
I want to use two methods to process large state in Flink, but want to understand its nature of origin. Although I read the articles provided in the Flink document, but not fully differentiate between the incremental checkpoint and the changelog state backend.
Is it almost same except that incremental checkpoint is focused on checkpoint mechanism while the change log state back end is focused on snapshot?
Any comments will be appreciated.
I faced with unexpected behavior when need start job from checkpoint and change Kafka topic. In this case Flink restore state for Kafka Consumer with early defined topic, last committed offset and consumer group id, as a result, Kafka Consumer starts consuming messages from two topics, the former one, which was restored from the state and the new one, defined in the configuration at the start of the job.
It's very confusing, and in the end, it's not entirely clear if it's a bug or a feature? Is there a way to manage recovery jobs from a checkpoint and at the same time not restore the state of Kafka consumers, but instead use the parameters from the configuration to initialize them?
I need a previous job state, but I want to get new data from another topic!
If you change the UID of the KafkaSource (or FlinkKafkaConsumer) and restart the job with allowNonRestoredState enabled, then you'll get the behavior you are looking for.
Changing the UID (or setting one, if you haven't explicitly set one) will prevent the saved Kafka offsets from being restored, and allowNonRestoredState will override Flink's built-in protections against losing state.
I'm using MQTT consumer as my flink job's data source. I'm wondering how to save the data offsets into checkpoint to ensure that no data lost when flink cluster restarts after a failure. I've see lots of articles introducing how apache flink manages kafka consumer offsets. Does anyone know whether apache flink has its own function to manage MQTT consumer? Thanks.
If you have a MQTT consumer, you should make sure it uses the Data Source API. You can read about that on https://nightlies.apache.org/flink/flink-docs-master/docs/dev/datastream/sources/ - That also includes how to work integrate with checkpointing. You can also read the details in FLIP-27 https://cwiki.apache.org/confluence/display/FLINK/FLIP-27%3A+Refactor+Source+Interface
You shoul read state-backends part of documentation. And checkpoints section.
When checkpointing is enabled, managed state is persisted to ensure
consistent recovery in case of failures. Where the state is persisted
during checkpointing depends on the chosen Checkpoint Storage.
We have a job which all the user feature and information are stored in keyed state. Each user feature represents a state descriptor. But we are evolving our features so sometimes some features are abandoned in our next release/version because we will no longer declare the abandoned feature state's descriptor in our code. My question is how flink takes care of those abandoned state? Will it no longer restore those abandoned state automatically?
If you are using Flink POJOs or Avro types, then Flink will automatically migrate the types and state for you. Otherwise, it will not, and you could implement a custom serializer instead. Or you could use the State Processor API to clean things up.
Can someone please help me understand the difference between Apache Flink's Checkpoints & Savepoints.
While i read the documentation, couldn't understand the difference! :s
Apache Flink's Checkpoints and Savepoints are similar in that way they both are mechanisms for preserving internal state of Flink's applications.
Checkpoints are taken automatically and are used for automatic restarting job in case of a failure.
Savepoints on the other hand are taken manually, are always stored externally and are used for starting a "new" job with previous internal state in case of e.g.
bug fixing
flink version upgrade
A/B testing, etc.
Underneath they are in fact the same mechanism/code path with some subtle nuances.
Edit:
You can also find a very good explanation in the official documentation https://ci.apache.org/projects/flink/flink-docs-stable/ops/state/savepoints.html#what-is-a-savepoint-how-is-a-savepoint-different-from-a-checkpoint :
A Savepoint is a consistent image of the execution state of a streaming job, created via Flink’s checkpointing mechanism. You can use Savepoints to stop-and-resume, fork, or update your Flink jobs. Savepoints consist of two parts: a directory with (typically large) binary files on stable storage (e.g. HDFS, S3, …) and a (relatively small) meta data file. The files on stable storage represent the net data of the job’s execution state image. The meta data file of a Savepoint contains (primarily) pointers to all files on stable storage that are part of the Savepoint, in form of absolute paths.
Attention: In order to allow upgrades between programs and Flink versions, it is important to check out the following section about assigning IDs to your operators.
Conceptually, Flink’s Savepoints are different from Checkpoints in a similar way that backups are different from recovery logs in traditional database systems. The primary purpose of Checkpoints is to provide a recovery mechanism in case of unexpected job failures. A Checkpoint’s lifecycle is managed by Flink, i.e. a Checkpoint is created, owned, and released by Flink - without user interaction. As a method of recovery and being periodically triggered, two main design goals for the Checkpoint implementation are i) being as lightweight to create and ii) being as fast to restore from as possible. Optimizations towards those goals can exploit certain properties, e.g. that the job code doesn’t change between the execution attempts. Checkpoints are usually dropped after the job was terminated by the user (except if explicitly configured as retained Checkpoints).
In contrast to all this, Savepoints are created, owned, and deleted by the user. Their use-case is for planned, manual backup and resume. For example, this could be an update of your Flink version, changing your job graph, changing parallelism, forking a second job like for a red/blue deployment, and so on. Of course, Savepoints must survive job termination. Conceptually, Savepoints can be a bit more expensive to produce and restore and focus more on portability and support for the previously mentioned changes to the job.
Those conceptual differences aside, the current implementations of Checkpoints and Savepoints are basically using the same code and produce the same format. However, there is currently one exception from this, and we might introduce more differences in the future. The exception are incremental checkpoints with the RocksDB state backend. They are using some RocksDB internal format instead of Flink’s native savepoint format. This makes them the first instance of a more lightweight checkpointing mechanism, compared to Savepoints.
Savepoints
Savepoints usually apply to an individual transaction; it marks a
point to which the transaction can be rolled back, so subsequent
changes can be undone if necessary.
More See Here
https://ci.apache.org/projects/flink/flink-docs-release-1.2/setup/cli.html#savepoints
Checkpoints
Checkpoints usually apply to whole systems, You can configure periodic checkpoints to be persisted externally. Externalized checkpoints write their meta data out to persistent storage and are not automatically cleaned up when the job fails.
More See Here:
https://ci.apache.org/projects/flink/flink-docs-release-1.2/setup/checkpoints.html
On difference I would like to add is savepoint can be manually applied when we upgrade the pipeline vs checkpoint kicks in as useful in case the pipeline restarts or crashes abruptly. However, there could be side effects to later where application(pipeline) has to handle any scenarios like re-processing duplicate data etc.