Flink state backend for TaskManager - apache-flink

I have a Flink v1.2 setup with 1 JobManager, 2 TaskManagers each in it's own VM. I configured the state backend to filesystem and pointed it to a local location in the case of each of the above hosts (state.backend.fs.checkpointdir: file:///home/ubuntu/Prototype/flink/flink-checkpoints). I have set parallelism to 1 and each taskanager has 1 slot.
I then run an event processing job on the JobManager which assigns it to a TaskManager.
I kill the TaskManager running the job and after a few unsuccessful attempts on the failed TaskManager Flink tries to run the job on the remaining TaskManager. At this point it fails again because it cannot find the corresponding checkpoints / state : java.io.FileNotFoundException: /home/ubuntu/Prototype/flink/flink-checkpoints/56c409681baeaf205bc1ba6cbe9f8091/chk-14/46f6e71d-ebfe-4b49-bf35-23c2e7f97923 (No such file or directory)
The folder /home/ubuntu/Prototype/flink/flink-checkpoints/56c409681baeaf205bc1ba6cbe9f8091 only exists on the TaskManager that I killed and not on the other one.
My question is am I supposed to set the same location for checkpointing / state on all the task managers if I want the above functionality?
Thanks!

The checkpoint directory you use needs to be shared across all machines that make up your Flink cluster. Typically this would be something like HDFS or S3 but can be any shared filesystem.

Related

Flink try to recover checkpoint from deleted directories

After cleaning the s3 bucket, which is used to store checkpoints from old files (files that have been accessed for more than a month), when restarting or when restoring from actual checkpoints, some processes of job do not start due to some missing old files
Job works well and save actual checkpoins (save path s3://flink-checkpoints/check/af8b0712ae0c1f20d2226b86e6bddb60/chk-100274)
2022-04-24 03:58:32.892 Triggering checkpoint 100273 # 1653353912890 for job af8b0712ae0c1f20d2226b86e6bddb60.
2022-04-24 03:58:55.317 Completed checkpoint 100273 for job af8b0712ae0c1f20d2226b86e6bddb60 (679053131 bytes in 22090 ms).
2022-04-24 04:03:32.892 Triggering checkpoint 100274 # 1653354212890 for job af8b0712ae0c1f20d2226b86e6bddb60.
2022-04-24 04:03:35.844 Completed checkpoint 100274 for job af8b0712ae0c1f20d2226b86e6bddb60 (9606712 bytes in 2494 ms).
After one taskmanager switched off and job restarted
2022-04-24 04:04:40.936 Job test-job (af8b0712ae0c1f20d2226b86e6bddb60) switched from state RUNNING to RESTARTING.
2022-04-24 04:05:14.150 Job test-job (af8b0712ae0c1f20d2226b86e6bddb60) switched from state RESTARTING to RUNNING.
2022-04-24 04:05:14.198 Restoring job af8b0712ae0c1f20d2226b86e6bddb60 from latest valid checkpoint: Checkpoint 100274 # 1653354212890 for af8b0712ae0c1f20d2226b86e6bddb60.
after some time job failed because some process can't restore state
2022-04-24 04:05:17.095 Job test-job (af8b0712ae0c1f20d2226b86e6bddb60) switched from state RUNNING to RESTARTING.
2022-04-24 04:05:17.093 Process first events -> Sink: Sink to test-job (5/10) (4f9089b1015540eb6e13afe4c07fa97b) switched from RUNNING to FAILED.
java.lang.Exception: Exception while creating StreamOperatorStateContext.
Caused by: org.apache.flink.util.FlinkException: Could not restore keyed state backend for KeyedProcessOperator_f1d5710fb330fd579d15b292e305802c_(5/10) from any of the 1 provided restore options.
Caused by: org.apache.flink.runtime.state.BackendBuildingException: Caught unexpected exception.
Caused by: org.apache.flink.util.FlinkRuntimeException: Failed to download data for state handles.
Caused by: com.facebook.presto.hive.s3.PrestoS3FileSystem$UnrecoverableS3OperationException: com.amazonaws.services.s3.model.AmazonS3Exception: null (Service: Amazon S3; Status Code: 404; Error Code: NoSuchKey; Request ID: tx0000000000000f0652d11-00628c2f4a-51f03da-default; S3 Extended Request ID: 51f03da-default-default), S3 Extended Request ID: 51f03da-default-default (Path: s3://flink-checkpoints/check/e3d82336005fc40be9af536938716199/shared/64452a30-c8a0-454f-8164-34d9e70142e0)
Caused by: com.amazonaws.services.s3.model.AmazonS3Exception: null (Service: Amazon S3; Status Code: 404; Error Code: NoSuchKey; Request ID: tx0000000000000f0652d11-00628c2f4a-51f03da-default; S3 Extended Request ID: 51f03da-default-default)
2022-04-24 04:05:17.095 Job test-job (af8b0712ae0c1f20d2226b86e6bddb60) switched from state RUNNING to RESTARTING.
if I completely cancel the job and start a new one with the savepoint set to the path of the last checkpoint, I get the same errors.
Why when working with checkpoint from af8b0712ae0c1f20d2226b86e6bddb60 folder job tries to get some files from e3d82336005fc40be9af536938716199 folder and what are the rules for clearing old checkpoints from storage?
UPDATE
I found that flink save s3 paths for all TaskManager's rocksdb files in chk-*/_metadata file.
This is something that was quite ambiguous for a long time and has recently been addressed in Flink 1.15. I would recommend reading https://flink.apache.org/news/2022/05/05/1.15-announcement.html on the section 'Clarification of checkpoint and savepoint semantics', including the section where the comparison between checkpoints and savepoints is made.
The behaviour you've experienced depends on your checkpointing setup (aligned vs unaligned).
By default cancelling job removes old checkpoints. There is a configuration flag to control it execution.checkpointing.externalized-checkpoint-retention. As mentioned by Martijn, normally you would resort to savepoints for controlled job upgrades / restarts.
I found that flink save s3 paths for all TaskManager's rocksdb files in chk-*/_metadata file.

how to keep apache flink task and submit record when restart jobmanager

I am using apache flink 1.10 to batch compute my stream data, today I move my apache flink kubernetes(v1.15.2) pod from machine 1 to machine 2 and find all submit task record and task list disappear, what's happening? the summit record is in the memory? what should I to keep my submit record and task list when restart the kubernetes pod of apache flink? I just found checkpoint persistant but nothing about tasks.
If lose the running task history, I must upload my task jar and recreate all task, so many task should to recreate if lose the history, is there any possible to resume the task automaticlly?
The configurations that might not be set are:
Job Manager
jobmanager.archive.fs.dir: hdfs:///completed-jobs
History Server
# Monitor the following directories for completed jobs
historyserver.archive.fs.dir: hdfs:///completed-jobs
# Refresh every 10 seconds
historyserver.archive.fs.refresh-interval: 10000
Please look at for more details: https://ci.apache.org/projects/flink/flink-docs-stable/monitoring/historyserver.html#configuration

Configuring Ports for Flink Job/Task Manager Metrics

I am running Flink in Amazon EMR. In flink-conf.yaml, I have metrics.reporter.prom.port: 9249-9250
Depending whether the job manager and task manager are running in the same node, the task manager metrics are reported on port 9250 (if running on same node as job manager), or on port 9249 (if running on a different node).
Is there a way to configure so that the task manager metrics are always reported on port 9250?
I saw a post that we can "provide each *Manager with a separate configuration." How to do that?
Thanks
You can configure different ports for the JM and TM by starting the processes with differently configured flink-conf.yaml.
On Yarn, Flink currently uses the same flink-conf.yaml for all processes.

Apache Flink number of taskmanagers in local mode

I am working on an Apache Flink (1.5.0) based streaming application.
As part of this I have launched Flink in local mode on my Windows machine.
In order to run my job with the degree of parallelism of 8, I need 8 Task managers providing one task slot each.
I added a task manager with following command:
/cygdrive/b/Binaries Flink/flink-1.5.0/bin/taskmanager.sh' start
The first few times, a task manager was added successfully with following message:
[INFO] 3 instance(s) of taskexecutor are already running on ... .
Starting taskexecutor daemon on host ... .
After 5 task managers were available I got the same message
[INFO] 5 instance(s) of taskexecutor are already running on ... .
Starting taskexecutor daemon on host ... .
The problem is that a sixth task manager is never created.
When I stop one task manager it goes down to 4, I can add one additional task manager but never more than 5 task managers.
Is there any limitation to the amount of task managers?
Did anyone experience a similar behaviour?
Thank you very much
There is no limit of how many TaskManager you can start locally. The only limit is the available resources you have on your local machine.
If you are using the standalone mode in Flink 1.5.0, then you can also set the number of slots per TaskManager to 7 by adding the following line to the flink-conf.yaml:
taskmanager.numberOfTaskSlots: 7

Remote debugging Flink local cluster

I want to deploy my jobs on a local Flink cluster during development (i.e. JobManager and TaskManager running on my development laptop), and use remote debugging. I tried adding
"-agentlib:jdwp=transport=dt_socket,server=y,suspend=y,address=5005" to the flink-conf.yaml file. Since job and task manager are running on the same machine, the task manager throws exception stating that the socket is already in use and terminates. Is there any way I can get this running.
You are probably setting env.java.opts, which affects all JVMs started by Flink. Since the jobmanager gets started first, it grabs the port before the taskmanager is started.
You can use env.java.opts.taskmanager to pass parameters only for taskmanager JVMs.

Resources