After cleaning the s3 bucket, which is used to store checkpoints from old files (files that have been accessed for more than a month), when restarting or when restoring from actual checkpoints, some processes of job do not start due to some missing old files
Job works well and save actual checkpoins (save path s3://flink-checkpoints/check/af8b0712ae0c1f20d2226b86e6bddb60/chk-100274)
2022-04-24 03:58:32.892 Triggering checkpoint 100273 # 1653353912890 for job af8b0712ae0c1f20d2226b86e6bddb60.
2022-04-24 03:58:55.317 Completed checkpoint 100273 for job af8b0712ae0c1f20d2226b86e6bddb60 (679053131 bytes in 22090 ms).
2022-04-24 04:03:32.892 Triggering checkpoint 100274 # 1653354212890 for job af8b0712ae0c1f20d2226b86e6bddb60.
2022-04-24 04:03:35.844 Completed checkpoint 100274 for job af8b0712ae0c1f20d2226b86e6bddb60 (9606712 bytes in 2494 ms).
After one taskmanager switched off and job restarted
2022-04-24 04:04:40.936 Job test-job (af8b0712ae0c1f20d2226b86e6bddb60) switched from state RUNNING to RESTARTING.
2022-04-24 04:05:14.150 Job test-job (af8b0712ae0c1f20d2226b86e6bddb60) switched from state RESTARTING to RUNNING.
2022-04-24 04:05:14.198 Restoring job af8b0712ae0c1f20d2226b86e6bddb60 from latest valid checkpoint: Checkpoint 100274 # 1653354212890 for af8b0712ae0c1f20d2226b86e6bddb60.
after some time job failed because some process can't restore state
2022-04-24 04:05:17.095 Job test-job (af8b0712ae0c1f20d2226b86e6bddb60) switched from state RUNNING to RESTARTING.
2022-04-24 04:05:17.093 Process first events -> Sink: Sink to test-job (5/10) (4f9089b1015540eb6e13afe4c07fa97b) switched from RUNNING to FAILED.
java.lang.Exception: Exception while creating StreamOperatorStateContext.
Caused by: org.apache.flink.util.FlinkException: Could not restore keyed state backend for KeyedProcessOperator_f1d5710fb330fd579d15b292e305802c_(5/10) from any of the 1 provided restore options.
Caused by: org.apache.flink.runtime.state.BackendBuildingException: Caught unexpected exception.
Caused by: org.apache.flink.util.FlinkRuntimeException: Failed to download data for state handles.
Caused by: com.facebook.presto.hive.s3.PrestoS3FileSystem$UnrecoverableS3OperationException: com.amazonaws.services.s3.model.AmazonS3Exception: null (Service: Amazon S3; Status Code: 404; Error Code: NoSuchKey; Request ID: tx0000000000000f0652d11-00628c2f4a-51f03da-default; S3 Extended Request ID: 51f03da-default-default), S3 Extended Request ID: 51f03da-default-default (Path: s3://flink-checkpoints/check/e3d82336005fc40be9af536938716199/shared/64452a30-c8a0-454f-8164-34d9e70142e0)
Caused by: com.amazonaws.services.s3.model.AmazonS3Exception: null (Service: Amazon S3; Status Code: 404; Error Code: NoSuchKey; Request ID: tx0000000000000f0652d11-00628c2f4a-51f03da-default; S3 Extended Request ID: 51f03da-default-default)
2022-04-24 04:05:17.095 Job test-job (af8b0712ae0c1f20d2226b86e6bddb60) switched from state RUNNING to RESTARTING.
if I completely cancel the job and start a new one with the savepoint set to the path of the last checkpoint, I get the same errors.
Why when working with checkpoint from af8b0712ae0c1f20d2226b86e6bddb60 folder job tries to get some files from e3d82336005fc40be9af536938716199 folder and what are the rules for clearing old checkpoints from storage?
UPDATE
I found that flink save s3 paths for all TaskManager's rocksdb files in chk-*/_metadata file.
This is something that was quite ambiguous for a long time and has recently been addressed in Flink 1.15. I would recommend reading https://flink.apache.org/news/2022/05/05/1.15-announcement.html on the section 'Clarification of checkpoint and savepoint semantics', including the section where the comparison between checkpoints and savepoints is made.
The behaviour you've experienced depends on your checkpointing setup (aligned vs unaligned).
By default cancelling job removes old checkpoints. There is a configuration flag to control it execution.checkpointing.externalized-checkpoint-retention. As mentioned by Martijn, normally you would resort to savepoints for controlled job upgrades / restarts.
I found that flink save s3 paths for all TaskManager's rocksdb files in chk-*/_metadata file.
Related
I am trying to use DMS to capture change logs from the SQL Server and write them to S3. I have set up a long polling period of 6 hours. (AWS recommends > 1 hour). DMS fails with the below error when the database is idle for a few hours during the night.
DMS Error:
Last Error AlwaysOn BACKUP-ed data is not available Task error notification received from subtask 0, thread 0
Error from cloud watch - Failed to access LSN '000033fc:00005314:01e6' in the backup log sets since BACKUP/LOG-s are not available
I am currently using DMS version 3.4.6 with multi-az.
I always thought the DMS reads the change data immediately after updating the T log with the DML changes. Why do we see this error even with a long polling period? Can someone explain why this issue is caused? how we can handle this?
I'm newer to flink. My flink job recieved messages from mq, and do some rule check and summary calculation, then write the result to rdbms. Sometimes the job will encounter NullPointException(due to my silly code) , or MQ connection Exception(due to non-exist Topic),and it just halted current message-processing, the job is still running, then next messages will still trigger the exception.
But today I restart DB and the job failed. What's the difference?
I am using apache flink 1.10 to batch compute my stream data, today I move my apache flink kubernetes(v1.15.2) pod from machine 1 to machine 2 and find all submit task record and task list disappear, what's happening? the summit record is in the memory? what should I to keep my submit record and task list when restart the kubernetes pod of apache flink? I just found checkpoint persistant but nothing about tasks.
If lose the running task history, I must upload my task jar and recreate all task, so many task should to recreate if lose the history, is there any possible to resume the task automaticlly?
The configurations that might not be set are:
Job Manager
jobmanager.archive.fs.dir: hdfs:///completed-jobs
History Server
# Monitor the following directories for completed jobs
historyserver.archive.fs.dir: hdfs:///completed-jobs
# Refresh every 10 seconds
historyserver.archive.fs.refresh-interval: 10000
Please look at for more details: https://ci.apache.org/projects/flink/flink-docs-stable/monitoring/historyserver.html#configuration
What happens when there is an Exception thrown from the jar application to the Task Manager while processing an event?
a) Flink Job Manager will kill the existing task manager and create a new task manager?
b) Task manager itself recovers from the failed execution and restart process using local state saved in RocksDB?
java.lang.IllegalArgumentException: "Application error-stack trace"
I have a doubt that if that same kind erroneous events are getting processed by each of the task manager available hence they all get killed and entire flink job is down.
I am noticing that if some application error comes then eventually entire job will get down.
Don't figured out the exact reason as of now.
In general, the exception in the Job should not cause the whole Task Manager to go down. We are talking about "normal" exceptions here. In such case the Job itself will fail and the Task Manager will try to restart it or not depending on the provided restart strategy.
Obviously, if for some reason Your Task Manager will die, for example due to the timeouts or something else. Then it will not be restarted automatically if You do not use some resource manager or orchestration tool like YARN or Kubernetes. The job in such case should be started after there are slots available.
As for the behaviour that You have described that the Job itself is "going down" I assume here that the job is simply going to FAILED state. This is due to the fact that different restart strategies have different thresholds for max number of retries and If the job will not work after the specified number of restarts it will simply go to failed state.
I have a Flink v1.2 setup with 1 JobManager, 2 TaskManagers each in it's own VM. I configured the state backend to filesystem and pointed it to a local location in the case of each of the above hosts (state.backend.fs.checkpointdir: file:///home/ubuntu/Prototype/flink/flink-checkpoints). I have set parallelism to 1 and each taskanager has 1 slot.
I then run an event processing job on the JobManager which assigns it to a TaskManager.
I kill the TaskManager running the job and after a few unsuccessful attempts on the failed TaskManager Flink tries to run the job on the remaining TaskManager. At this point it fails again because it cannot find the corresponding checkpoints / state : java.io.FileNotFoundException: /home/ubuntu/Prototype/flink/flink-checkpoints/56c409681baeaf205bc1ba6cbe9f8091/chk-14/46f6e71d-ebfe-4b49-bf35-23c2e7f97923 (No such file or directory)
The folder /home/ubuntu/Prototype/flink/flink-checkpoints/56c409681baeaf205bc1ba6cbe9f8091 only exists on the TaskManager that I killed and not on the other one.
My question is am I supposed to set the same location for checkpointing / state on all the task managers if I want the above functionality?
Thanks!
The checkpoint directory you use needs to be shared across all machines that make up your Flink cluster. Typically this would be something like HDFS or S3 but can be any shared filesystem.