I have deployed an standalone flink( 1.15.0) cluster with 3 masters and i am using Zookeeper(3.5.0) to provide high availability. Here i share my flink.yml configuration:
high-availability: zookeeper
high-availability.storageDir: s3://bucket-name/flink
high-availability.zookeeper.quorum: zookeeper-dns:2181
state.checkpoints.dir: s3://bucket-name/flink/checkpoints
high-availability.cluster-id: flinkId
The problem is when for some reason all 3 jobmanagers fail, for example the first 1 stops and then starts again, then the second one stops and starts again and when the third one stops, the taskmanagers can't connect anymore to job managers.
I can see this logs:
2022-09-01 23:22:50,616 INFO org.apache.flink.runtime.leaderretrieval.DefaultLeaderRetrievalService [] - Starting DefaultLeaderRetrievalService with ZookeeperLeaderRetrievalDriver{connectionInformationPath='/resource_manager/connection_info'}.
2022-09-01 23:22:50,626 INFO org.apache.flink.runtime.leaderretrieval.DefaultLeaderRetrievalService [] - Starting DefaultLeaderRetrievalService with ZookeeperLeaderRetrievalDriver{connectionInformationPath='/dispatcher/connection_info'}.
2022-09-01 23:22:50,698 WARN akka.remote.transport.netty.NettyTransport 2022-09-01 23:22:50,705 WARN akka.remote.ReliableDeliverySupervisor [] - Association with remote system [akka.tcp://flink#127.0.0.1:50505] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink#127.0.0.1:50505]] Caused by: [java.net.ConnectException: Connection refused: /127.0.0.1:50505]
2022-09-01 23:22:50,698 WARN akka.remote.transport.netty.NettyTransport 2022-09-01 23:22:50,705 WARN akka.remote.ReliableDeliverySupervisor [] - Association with remote system [akka.tcp://flink#127.0.0.1:50505] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink#127.0.0.1:50505]] Caused by: [java.net.ConnectException: Connection refused: /127.0.0.1:50505]
I'm trying to create simple multi node flink cluster (1 master 1 slave). When I start my cluster using "./bin/start-cluster.sh", both job manager and task manager are started, but the task manager is not able to register at the job manager. After few minutes of trying, the task manager dies.
Details about the environment:
I'm working with Google cloud VMs. OS is Ubuntu x86_64
tried with flink versions flink-1.7.2 and flink-1.8.0. Both gave the same error.
job manager hostname = ubuntu-test-1 (10.142.0.40)task manager hostname = ubuntu-test-2 (10.142.15.250)
$ cat conf/flink-conf.yaml:
env.java.home: /opt/sample/include/jdk
jobmanager.rpc.address: 10.142.0.40
jobmanager.rpc.port: 6123
jobmanager.heap.size: 1024m
taskmanager.heap.size: 1024m
taskmanager.numberOfTaskSlots: 1
parallelism.default: 1
rest.port: 8081
$cat conf/masters
10.142.0.40:8081
$ cat conf/slaves
10.142.15.250
Below is the complete log from task manager:
2019-06-25 05:44:36,335 INFO org.apache.flink.runtime.taskexecutor.TaskManagerRunner - --------------------------------------------------------------------------------
2019-06-25 05:44:36,336 INFO org.apache.flink.runtime.taskexecutor.TaskManagerRunner - Starting TaskManager (Version: 1.7.2, Rev:ceba8af, Date:11.02.2019 # 14:17:09 UTC)
2019-06-25 05:44:36,337 INFO org.apache.flink.runtime.taskexecutor.TaskManagerRunner - OS current user: sample
2019-06-25 05:44:36,337 INFO org.apache.flink.runtime.taskexecutor.TaskManagerRunner - Current Hadoop/Kerberos user: <no hadoop dependency found>
2019-06-25 05:44:36,337 INFO org.apache.flink.runtime.taskexecutor.TaskManagerRunner - JVM: Java HotSpot(TM) 64-Bit Server VM - Oracle Corporation - 1.8/25.121-b13
2019-06-25 05:44:36,337 INFO org.apache.flink.runtime.taskexecutor.TaskManagerRunner - Maximum heap size: 922 MiBytes
2019-06-25 05:44:36,337 INFO org.apache.flink.runtime.taskexecutor.TaskManagerRunner - JAVA_HOME: (not set)
2019-06-25 05:44:36,337 INFO org.apache.flink.runtime.taskexecutor.TaskManagerRunner - No Hadoop Dependency available
2019-06-25 05:44:36,337 INFO org.apache.flink.runtime.taskexecutor.TaskManagerRunner - JVM Options:
2019-06-25 05:44:36,337 INFO org.apache.flink.runtime.taskexecutor.TaskManagerRunner - -XX:+UseG1GC
2019-06-25 05:44:36,338 INFO org.apache.flink.runtime.taskexecutor.TaskManagerRunner - -Xms922M
2019-06-25 05:44:36,338 INFO org.apache.flink.runtime.taskexecutor.TaskManagerRunner - -Xmx922M
2019-06-25 05:44:36,338 INFO org.apache.flink.runtime.taskexecutor.TaskManagerRunner - -XX:MaxDirectMemorySize=8388607T
2019-06-25 05:44:36,338 INFO org.apache.flink.runtime.taskexecutor.TaskManagerRunner - -Dlog.file=/var/tmp/flink-1.7.2/log/flink-sample-taskexecutor-0-ubuntu-test-2.log
2019-06-25 05:44:36,338 INFO org.apache.flink.runtime.taskexecutor.TaskManagerRunner - -Dlog4j.configuration=file:/var/tmp/flink-1.7.2/conf/log4j.properties
2019-06-25 05:44:36,338 INFO org.apache.flink.runtime.taskexecutor.TaskManagerRunner - -Dlogback.configurationFile=file:/var/tmp/flink-1.7.2/conf/logback.xml
2019-06-25 05:44:36,338 INFO org.apache.flink.runtime.taskexecutor.TaskManagerRunner - Program Arguments:
2019-06-25 05:44:36,338 INFO org.apache.flink.runtime.taskexecutor.TaskManagerRunner - --configDir
2019-06-25 05:44:36,338 INFO org.apache.flink.runtime.taskexecutor.TaskManagerRunner - /var/tmp/flink-1.7.2/conf
2019-06-25 05:44:36,338 INFO org.apache.flink.runtime.taskexecutor.TaskManagerRunner - Classpath: /var/tmp/flink-1.7.2/lib/flink-python_2.11-1.7.2.jar:/var/tmp/flink-1.7.2/lib/log4j-1.2.17.jar:/var/tmp/flink-1.7.2/lib/slf4j-log4j12-1.7.15.jar:/var/tmp/flink-1.7.2/lib/flink-dist_2.11-1.7.2.jar:::
2019-06-25 05:44:36,338 INFO org.apache.flink.runtime.taskexecutor.TaskManagerRunner - --------------------------------------------------------------------------------
2019-06-25 05:44:36,339 INFO org.apache.flink.runtime.taskexecutor.TaskManagerRunner - Registered UNIX signal handlers for [TERM, HUP, INT]
2019-06-25 05:44:36,343 INFO org.apache.flink.runtime.taskexecutor.TaskManagerRunner - Maximum number of open file descriptors is 100000.
2019-06-25 05:44:36,352 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: env.java.home, /opt/sample/include/jdk
2019-06-25 05:44:36,353 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: jobmanager.rpc.address, 10.142.0.40
2019-06-25 05:44:36,353 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: jobmanager.rpc.port, 6123
2019-06-25 05:44:36,353 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: jobmanager.heap.size, 1024m
2019-06-25 05:44:36,353 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: taskmanager.heap.size, 1024m
2019-06-25 05:44:36,353 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: taskmanager.numberOfTaskSlots, 1
2019-06-25 05:44:36,353 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: parallelism.default, 1
2019-06-25 05:44:36,354 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: rest.port, 8081
2019-06-25 05:44:36,360 INFO org.apache.flink.core.fs.FileSystem - Hadoop is not in the classpath/dependencies. The extended set of supported File Systems via Hadoop is not available.
2019-06-25 05:44:36,376 INFO org.apache.flink.runtime.security.modules.HadoopModuleFactory - Cannot create Hadoop Security Module because Hadoop cannot be found in the Classpath.
2019-06-25 05:44:36,395 INFO org.apache.flink.runtime.security.SecurityUtils - Cannot install HadoopSecurityContext because Hadoop cannot be found in the Classpath.
2019-06-25 05:44:36,559 WARN org.apache.flink.configuration.Configuration - Config uses deprecated configuration key 'jobmanager.rpc.address' instead of proper key 'rest.address'
2019-06-25 05:44:36,563 INFO org.apache.flink.runtime.util.LeaderRetrievalUtils - Trying to select the network interface and address to use by connecting to the leading JobManager.
2019-06-25 05:44:36,564 INFO org.apache.flink.runtime.util.LeaderRetrievalUtils - TaskManager will try to connect for 10000 milliseconds before falling back to heuristics
2019-06-25 05:44:36,567 INFO org.apache.flink.runtime.net.ConnectionUtils - Retrieved new target address /10.142.0.40:6123.
2019-06-25 05:44:36,571 INFO org.apache.flink.runtime.taskexecutor.TaskManagerRunner - TaskManager will use hostname/address 'ubuntu-test-2' (10.142.15.250) for communication.
2019-06-25 05:44:36,574 INFO org.apache.flink.runtime.rpc.akka.AkkaRpcServiceUtils - Trying to start actor system at ubuntu-test-2:0
2019-06-25 05:44:36,935 INFO akka.event.slf4j.Slf4jLogger - Slf4jLogger started
2019-06-25 05:44:37,004 INFO akka.remote.Remoting - Starting remoting
2019-06-25 05:44:37,108 INFO akka.remote.Remoting - Remoting started; listening on addresses :[akka.tcp://flink#ubuntu-test-2:33391]
2019-06-25 05:44:37,115 INFO org.apache.flink.runtime.rpc.akka.AkkaRpcServiceUtils - Actor system started at akka.tcp://flink#ubuntu-test-2:33391
2019-06-25 05:44:37,121 INFO org.apache.flink.runtime.taskexecutor.TaskManagerRunner - Trying to start actor system at ubuntu-test-2:0
2019-06-25 05:44:37,138 INFO akka.event.slf4j.Slf4jLogger - Slf4jLogger started
2019-06-25 05:44:37,144 INFO akka.remote.Remoting - Starting remoting
2019-06-25 05:44:37,152 INFO akka.remote.Remoting - Remoting started; listening on addresses :[akka.tcp://flink-metrics#ubuntu-test-2:46253]
2019-06-25 05:44:37,153 INFO org.apache.flink.runtime.taskexecutor.TaskManagerRunner - Actor system started at akka.tcp://flink-metrics#ubuntu-test-2:46253
2019-06-25 05:44:37,166 INFO org.apache.flink.runtime.metrics.MetricRegistryImpl - No metrics reporter configured, no metrics will be exposed/reported.
2019-06-25 05:44:37,171 INFO org.apache.flink.runtime.blob.PermanentBlobCache - Created BLOB cache storage directory /tmp/blobStore-4219e8ab-64ab-4eff-8320-8a50b550959d
2019-06-25 05:44:37,174 INFO org.apache.flink.runtime.blob.TransientBlobCache - Created BLOB cache storage directory /tmp/blobStore-959579c0-4892-4ba8-b7d3-63969e84f554
2019-06-25 05:44:37,175 INFO org.apache.flink.runtime.taskexecutor.TaskManagerRunner - Starting TaskManager with ResourceID: 3743bd08e81673b79e96d98ebab7a58a
2019-06-25 05:44:37,179 INFO org.apache.flink.runtime.io.network.netty.NettyConfig - NettyConfig [server address: ubuntu-test-2/10.142.15.250, server port: 0, ssl enabled: false, memory segment size (bytes): 32768, transport type: NIO, number of server threads: 1 (manual), number of client threads: 1 (manual), server connect backlog: 0 (use Netty's default), client connect timeout (sec): 120, send/receive buffer size (bytes): 0 (use Netty's default)]
2019-06-25 05:44:37,224 INFO org.apache.flink.runtime.taskexecutor.TaskManagerServices - Temporary file directory '/tmp': total 96 GB, usable 86 GB (89.58% usable)
2019-06-25 05:44:37,305 INFO org.apache.flink.runtime.io.network.buffer.NetworkBufferPool - Allocated 102 MB for network buffer pool (number of memory segments: 3278, bytes per segment: 32768).
2019-06-25 05:44:37,354 INFO org.apache.flink.runtime.query.QueryableStateUtils - Could not load Queryable State Client Proxy. Probable reason: flink-queryable-state-runtime is not in the classpath. To enable Queryable State, please move the flink-queryable-state-runtime jar from the opt to the lib folder.
2019-06-25 05:44:37,355 INFO org.apache.flink.runtime.query.QueryableStateUtils - Could not load Queryable State Server. Probable reason: flink-queryable-state-runtime is not in the classpath. To enable Queryable State, please move the flink-queryable-state-runtime jar from the opt to the lib folder.
2019-06-25 05:44:37,357 INFO org.apache.flink.runtime.io.network.NetworkEnvironment - Starting the network environment and its components.
2019-06-25 05:44:37,389 INFO org.apache.flink.runtime.io.network.netty.NettyClient - Successful initialization (took 30 ms).
2019-06-25 05:44:37,432 INFO org.apache.flink.runtime.io.network.netty.NettyServer - Successful initialization (took 42 ms). Listening on SocketAddress /10.142.15.250:41521.
2019-06-25 05:44:37,433 INFO org.apache.flink.runtime.taskexecutor.TaskManagerServices - Limiting managed memory to 0.7 of the currently free heap space (640 MB), memory will be allocated lazily.
2019-06-25 05:44:37,436 INFO org.apache.flink.runtime.io.disk.iomanager.IOManager - I/O manager uses directory /tmp/flink-io-9b6408aa-3a29-477b-8a4b-661401bad5b6 for spill files.
2019-06-25 05:44:37,496 INFO org.apache.flink.runtime.taskexecutor.TaskManagerConfiguration - Messages have a max timeout of 10000 ms
2019-06-25 05:44:37,503 INFO org.apache.flink.runtime.rpc.akka.AkkaRpcService - Starting RPC endpoint for org.apache.flink.runtime.taskexecutor.TaskExecutor at akka://flink/user/taskmanager_0 .
2019-06-25 05:44:37,520 INFO org.apache.flink.runtime.taskexecutor.JobLeaderService - Start job leader service.
2019-06-25 05:44:37,521 INFO org.apache.flink.runtime.taskexecutor.TaskExecutor - Connecting to ResourceManager akka.tcp://flink#10.142.0.40:6123/user/resourcemanager(00000000000000000000000000000000).
2019-06-25 05:44:37,521 INFO org.apache.flink.runtime.filecache.FileCache - User file cache uses directory /tmp/flink-dist-cache-504118c3-1bc2-4624-b1c4-7eacce681ba9
2019-06-25 05:44:47,542 INFO org.apache.flink.runtime.taskexecutor.TaskExecutor - Could not resolve ResourceManager address akka.tcp://flink#10.142.0.40:6123/user/resourcemanager, retrying in 10000 ms: Ask timed out on [ActorSelection[Anchor(akka.tcp://flink#10.142.0.40:6123/), Path(/user/resourcemanager)]] after [10000 ms]. Sender[null] sent message of type "akka.actor.Identify"..
2019-06-25 05:45:07,580 INFO org.apache.flink.runtime.taskexecutor.TaskExecutor - Could not resolve ResourceManager address akka.tcp://flink#10.142.0.40:6123/user/resourcemanager, retrying in 10000 ms: Ask timed out on [ActorSelection[Anchor(akka.tcp://flink#10.142.0.40:6123/), Path(/user/resourcemanager)]] after [10000 ms]. Sender[null] sent message of type "akka.actor.Identify"..
2019-06-25 05:45:27,620 INFO org.apache.flink.runtime.taskexecutor.TaskExecutor - Could not resolve ResourceManager address akka.tcp://flink#10.142.0.40:6123/user/resourcemanager, retrying in 10000 ms: Ask timed out on [ActorSelection[Anchor(akka.tcp://flink#10.142.0.40:6123/), Path(/user/resourcemanager)]] after [10000 ms]. Sender[null] sent message of type "akka.actor.Identify"..
2019-06-25 05:45:47,660 INFO org.apache.flink.runtime.taskexecutor.TaskExecutor - Could not resolve ResourceManager address akka.tcp://flink#10.142.0.40:6123/user/resourcemanager, retrying in 10000 ms: Ask timed out on [ActorSelection[Anchor(akka.tcp://flink#10.142.0.40:6123/), Path(/user/resourcemanager)]] after [10000 ms]. Sender[null] sent message of type "akka.actor.Identify"..
2019-06-25 05:46:07,700 INFO org.apache.flink.runtime.taskexecutor.TaskExecutor - Could not resolve ResourceManager address akka.tcp://flink#10.142.0.40:6123/user/resourcemanager, retrying in 10000 ms: Ask timed out on [ActorSelection[Anchor(akka.tcp://flink#10.142.0.40:6123/), Path(/user/resourcemanager)]] after [10000 ms]. Sender[null] sent message of type "akka.actor.Identify"..
2019-06-25 05:46:27,741 INFO org.apache.flink.runtime.taskexecutor.TaskExecutor - Could not resolve ResourceManager address akka.tcp://flink#10.142.0.40:6123/user/resourcemanager, retrying in 10000 ms: Ask timed out on [ActorSelection[Anchor(akka.tcp://flink#10.142.0.40:6123/), Path(/user/resourcemanager)]] after [10000 ms]. Sender[null] sent message of type "akka.actor.Identify"..
2019-06-25 05:46:47,780 INFO org.apache.flink.runtime.taskexecutor.TaskExecutor - Could not resolve ResourceManager address akka.tcp://flink#10.142.0.40:6123/user/resourcemanager, retrying in 10000 ms: Ask timed out on [ActorSelection[Anchor(akka.tcp://flink#10.142.0.40:6123/), Path(/user/resourcemanager)]] after [10000 ms]. Sender[null] sent message of type "akka.actor.Identify"..
2019-06-25 05:47:07,820 INFO org.apache.flink.runtime.taskexecutor.TaskExecutor - Could not resolve ResourceManager address akka.tcp://flink#10.142.0.40:6123/user/resourcemanager, retrying in 10000 ms: Ask timed out on [ActorSelection[Anchor(akka.tcp://flink#10.142.0.40:6123/), Path(/user/resourcemanager)]] after [10000 ms]. Sender[null] sent message of type "akka.actor.Identify"..
2019-06-25 05:47:27,860 INFO org.apache.flink.runtime.taskexecutor.TaskExecutor - Could not resolve ResourceManager address akka.tcp://flink#10.142.0.40:6123/user/resourcemanager, retrying in 10000 ms: Ask timed out on [ActorSelection[Anchor(akka.tcp://flink#10.142.0.40:6123/), Path(/user/resourcemanager)]] after [10000 ms]. Sender[null] sent message of type "akka.actor.Identify"..
2019-06-25 05:47:47,900 INFO org.apache.flink.runtime.taskexecutor.TaskExecutor - Could not resolve ResourceManager address akka.tcp://flink#10.142.0.40:6123/user/resourcemanager, retrying in 10000 ms: Ask timed out on [ActorSelection[Anchor(akka.tcp://flink#10.142.0.40:6123/), Path(/user/resourcemanager)]] after [10000 ms]. Sender[null] sent message of type "akka.actor.Identify"..
2019-06-25 05:48:07,940 INFO org.apache.flink.runtime.taskexecutor.TaskExecutor - Could not resolve ResourceManager address akka.tcp://flink#10.142.0.40:6123/user/resourcemanager, retrying in 10000 ms: Ask timed out on [ActorSelection[Anchor(akka.tcp://flink#10.142.0.40:6123/), Path(/user/resourcemanager)]] after [10000 ms]. Sender[null] sent message of type "akka.actor.Identify"..
2019-06-25 05:48:27,980 INFO org.apache.flink.runtime.taskexecutor.TaskExecutor - Could not resolve ResourceManager address akka.tcp://flink#10.142.0.40:6123/user/resourcemanager, retrying in 10000 ms: Ask timed out on [ActorSelection[Anchor(akka.tcp://flink#10.142.0.40:6123/), Path(/user/resourcemanager)]] after [10000 ms]. Sender[null] sent message of type "akka.actor.Identify"..
2019-06-25 05:48:48,020 INFO org.apache.flink.runtime.taskexecutor.TaskExecutor - Could not resolve ResourceManager address akka.tcp://flink#10.142.0.40:6123/user/resourcemanager, retrying in 10000 ms: Ask timed out on [ActorSelection[Anchor(akka.tcp://flink#10.142.0.40:6123/), Path(/user/resourcemanager)]] after [10000 ms]. Sender[null] sent message of type "akka.actor.Identify"..
2019-06-25 05:49:08,060 INFO org.apache.flink.runtime.taskexecutor.TaskExecutor - Could not resolve ResourceManager address akka.tcp://flink#10.142.0.40:6123/user/resourcemanager, retrying in 10000 ms: Ask timed out on [ActorSelection[Anchor(akka.tcp://flink#10.142.0.40:6123/), Path(/user/resourcemanager)]] after [10000 ms]. Sender[null] sent message of type "akka.actor.Identify"..
2019-06-25 05:49:28,100 INFO org.apache.flink.runtime.taskexecutor.TaskExecutor - Could not resolve ResourceManager address akka.tcp://flink#10.142.0.40:6123/user/resourcemanager, retrying in 10000 ms: Ask timed out on [ActorSelection[Anchor(akka.tcp://flink#10.142.0.40:6123/), Path(/user/resourcemanager)]] after [10000 ms]. Sender[null] sent message of type "akka.actor.Identify"..
2019-06-25 05:49:37,541 ERROR org.apache.flink.runtime.taskexecutor.TaskExecutor - Fatal error occurred in TaskExecutor akka.tcp://flink#ubuntu-test-2:33391/user/taskmanager_0.
org.apache.flink.runtime.taskexecutor.exceptions.RegistrationTimeoutException: Could not register at the ResourceManager within the specified maximum registration duration 300000 ms. This indicates a problem with this instance. Terminating now.
at org.apache.flink.runtime.taskexecutor.TaskExecutor.registrationTimeout(TaskExecutor.java:1037)
at org.apache.flink.runtime.taskexecutor.TaskExecutor.lambda$startRegistrationTimeout$3(TaskExecutor.java:1023)
at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:332)
at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:158)
at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.onReceive(AkkaRpcActor.java:142)
at akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:165)
at akka.actor.Actor$class.aroundReceive(Actor.scala:502)
at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:95)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)
at akka.actor.ActorCell.invoke(ActorCell.scala:495)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)
at akka.dispatch.Mailbox.run(Mailbox.scala:224)
at akka.dispatch.Mailbox.exec(Mailbox.scala:234)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
2019-06-25 05:49:37,544 ERROR org.apache.flink.runtime.taskexecutor.TaskManagerRunner - Fatal error occurred while executing the TaskManager. Shutting it down...
org.apache.flink.runtime.taskexecutor.exceptions.RegistrationTimeoutException: Could not register at the ResourceManager within the specified maximum registration duration 300000 ms. This indicates a problem with this instance. Terminating now.
at org.apache.flink.runtime.taskexecutor.TaskExecutor.registrationTimeout(TaskExecutor.java:1037)
at org.apache.flink.runtime.taskexecutor.TaskExecutor.lambda$startRegistrationTimeout$3(TaskExecutor.java:1023)
at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:332)
at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:158)
at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.onReceive(AkkaRpcActor.java:142)
at akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:165)
at akka.actor.Actor$class.aroundReceive(Actor.scala:502)
at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:95)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)
at akka.actor.ActorCell.invoke(ActorCell.scala:495)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)
at akka.dispatch.Mailbox.run(Mailbox.scala:224)
at akka.dispatch.Mailbox.exec(Mailbox.scala:234)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
2019-06-25 05:49:37,550 INFO org.apache.flink.runtime.taskexecutor.TaskExecutor - Stopping TaskExecutor akka.tcp://flink#ubuntu-test-2:33391/user/taskmanager_0.
2019-06-25 05:49:37,551 INFO org.apache.flink.runtime.state.TaskExecutorLocalStateStoresManager - Shutting down TaskExecutorLocalStateStoresManager.
2019-06-25 05:49:37,554 INFO org.apache.flink.runtime.io.disk.iomanager.IOManager - I/O manager removed spill file directory /tmp/flink-io-9b6408aa-3a29-477b-8a4b-661401bad5b6
2019-06-25 05:49:37,554 INFO org.apache.flink.runtime.io.network.NetworkEnvironment - Shutting down the network environment and its components.
2019-06-25 05:49:37,554 INFO org.apache.flink.runtime.io.network.netty.NettyClient - Successful shutdown (took 0 ms).
2019-06-25 05:49:37,555 INFO org.apache.flink.runtime.io.network.netty.NettyServer - Successful shutdown (took 0 ms).
2019-06-25 05:49:37,561 INFO org.apache.flink.runtime.taskexecutor.JobLeaderService - Stop job leader service.
2019-06-25 05:49:37,562 INFO org.apache.flink.runtime.taskexecutor.TaskExecutor - Stopped TaskExecutor akka.tcp://flink#ubuntu-test-2:33391/user/taskmanager_0.
2019-06-25 05:49:37,563 INFO org.apache.flink.runtime.blob.PermanentBlobCache - Shutting down BLOB cache
2019-06-25 05:49:37,563 INFO org.apache.flink.runtime.blob.TransientBlobCache - Shutting down BLOB cache
2019-06-25 05:49:37,570 INFO org.apache.flink.runtime.rpc.akka.AkkaRpcService - Stopping Akka RPC service.
2019-06-25 05:49:37,576 INFO akka.remote.RemoteActorRefProvider$RemotingTerminator - Shutting down remote daemon.
2019-06-25 05:49:37,577 INFO akka.remote.RemoteActorRefProvider$RemotingTerminator - Remote daemon shut down; proceeding with flushing remote transports.
2019-06-25 05:49:37,580 INFO akka.remote.RemoteActorRefProvider$RemotingTerminator - Shutting down remote daemon.
2019-06-25 05:49:37,584 INFO akka.remote.RemoteActorRefProvider$RemotingTerminator - Remote daemon shut down; proceeding with flushing remote transports.
2019-06-25 05:49:37,596 INFO akka.remote.RemoteActorRefProvider$RemotingTerminator - Remoting shut down.
2019-06-25 05:49:37,597 INFO akka.remote.RemoteActorRefProvider$RemotingTerminator - Remoting shut down. 41,1 Top
Looks like the problem was that I used IP addresses instead of hostnames. This was already pointed out in some other thread on SO. When I read that thread, I thought the reason was because IP addresses can change over time for the same host. Looks like, using IP addresses does not work, even if they don't change.
Wondering why then, in flink documentation, they showed IP addresses.
https://ci.apache.org/projects/flink/flink-docs-release-1.8/ops/deployment/cluster_setup.html
I had same issue,
make sure you are using jdk-1.8 as flink 1.7.2 need jdk-1.8, worked for me!
check if below environment variable set, while docker setup.
FLINK_PROPERTIES="jobmanager.rpc.address: jobmanager"
or check jobmanager.rpc.address configuration in other cases.
I have a setup with flink 1.2 cluster, made up of 3 JobManagers and 2 TaskManagers. I start the Zookeeper Quorum from JobManager1, I get confirmation Zookeeper starts on the other 2 JobManagers then I start a Flink job on this JobManager1.
The flink-conf.yaml is the same on all 5 VMs this means jobmanager.rpc.address: points to JobManager1 everywhere.
If I turn off the VM running JobManager1 I would expect Zookeeper to say one of the remaining JobManagers is the leader and the TaskManagers should reconnect to it. Instead I get in the TaskManagers' logs a lot of these messages
2017-03-14 14:13:21,827 INFO org.apache.flink.runtime.taskmanager.TaskManager - Trying to register at JobManager akka.tcp://flink#1.2.3.4:43660/user/jobmanager (attempt 11, timeout: 30 seconds)
2017-03-14 14:13:21,836 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink#1.2.3.4:43660] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://flink#1.2.3.4:43660]] Caused by: [Connection refused: /1.2.3.4:43660]
I modified the original IP to 1.2.3.4 for confidentiality and because it's always the same IP (of JobManager1).
More logs:
2017-03-15 10:28:28,655 INFO org.apache.flink.core.fs.FileSystem - Ensuring all FileSystem streams are closed for Async calls on Source: Custom Source -> Flat Map (1/1)
2017-03-15 10:28:38,534 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink#1.2.3.4:44779] has failed, address is now gated for [5000] ms. Reason: [Disassociated]
2017-03-15 10:28:46,606 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink#1.2.3.4:44779] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://flink#1.2.3.4:44779]] Caused by: [Connection refused: /1.2.3.4:44779]
2017-03-15 10:28:52,431 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink#1.2.3.4:44779] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://flink#1.2.3.4:44779]] Caused by: [Connection refused: /1.2.3.4:44779]
2017-03-15 10:29:02,435 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink#1.2.3.4:44779] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://flink#1.2.3.4:44779]] Caused by: [Connection refused: /1.2.3.4:44779]
2017-03-15 10:29:10,489 INFO org.apache.flink.runtime.taskmanager.TaskManager - TaskManager akka://flink/user/taskmanager disconnects from JobManager akka.tcp://flink#1.2.3.4:44779/user/jobmanager: Old JobManager lost its leadership.
2017-03-15 10:29:10,490 INFO org.apache.flink.runtime.taskmanager.TaskManager - Cancelling all computations and discarding all cached data.
2017-03-15 10:29:10,491 INFO org.apache.flink.runtime.taskmanager.Task - Attempting to fail task externally Source: Custom Source -> Flat Map (1/1) (75fd495cc6acfd72fbe957e60e513223).
2017-03-15 10:29:10,491 INFO org.apache.flink.runtime.taskmanager.Task - Source: Custom Source -> Flat Map (1/1) (75fd495cc6acfd72fbe957e60e513223) switched from RUNNING to FAILED.
java.lang.Exception: TaskManager akka://flink/user/taskmanager disconnects from JobManager akka.tcp://flink#1.2.3.4:44779/user/jobmanager: Old JobManager lost its leadership.
at org.apache.flink.runtime.taskmanager.TaskManager.handleJobManagerDisconnect(TaskManager.scala:1074)
at org.apache.flink.runtime.taskmanager.TaskManager.org$apache$flink$runtime$taskmanager$TaskManager$$handleJobManagerLeaderAddress(TaskManager.scala:1426)
at org.apache.flink.runtime.taskmanager.TaskManager$$anonfun$handleMessage$1.applyOrElse(TaskManager.scala:286)
at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
at org.apache.flink.runtime.LeaderSessionMessageFilter$$anonfun$receive$1.applyOrElse(LeaderSessionMessageFilter.scala:44)
at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
at org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:33)
at org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:28)
at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123)
at org.apache.flink.runtime.LogMessages$$anon$1.applyOrElse(LogMessages.scala:28)
at akka.actor.Actor$class.aroundReceive(Actor.scala:467)
at org.apache.flink.runtime.taskmanager.TaskManager.aroundReceive(TaskManager.scala:122)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
at akka.actor.ActorCell.invoke(ActorCell.scala:487)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
at akka.dispatch.Mailbox.run(Mailbox.scala:220)
at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:397)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
2017-03-15 10:29:10,512 INFO org.apache.flink.runtime.taskmanager.Task - Triggering cancellation of task code Source: Custom Source -> Flat Map (1/1) (75fd495cc6acfd72fbe957e60e513223).
2017-03-15 10:29:10,515 INFO org.apache.flink.runtime.taskmanager.Task - Attempting to fail task externally Flat Map (1/1) (dd555e0437867c3180a1ecaf0a9f4d04).
2017-03-15 10:29:10,515 INFO org.apache.flink.runtime.taskmanager.Task - Flat Map (1/1) (dd555e0437867c3180a1ecaf0a9f4d04) switched from RUNNING to FAILED.
java.lang.Exception: TaskManager akka://flink/user/taskmanager disconnects from JobManager akka.tcp://flink#1.2.3.4:44779/user/jobmanager: Old JobManager lost its leadership.
at org.apache.flink.runtime.taskmanager.TaskManager.handleJobManagerDisconnect(TaskManager.scala:1074)
at org.apache.flink.runtime.taskmanager.TaskManager.org$apache$flink$runtime$taskmanager$TaskManager$$handleJobManagerLeaderAddress(TaskManager.scala:1426)
at org.apache.flink.runtime.taskmanager.TaskManager$$anonfun$handleMessage$1.applyOrElse(TaskManager.scala:286)
at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
at org.apache.flink.runtime.LeaderSessionMessageFilter$$anonfun$receive$1.applyOrElse(LeaderSessionMessageFilter.scala:44)
at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
at org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:33)
at org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:28)
at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123)
at org.apache.flink.runtime.LogMessages$$anon$1.applyOrElse(LogMessages.scala:28)
at akka.actor.Actor$class.aroundReceive(Actor.scala:467)
at org.apache.flink.runtime.taskmanager.TaskManager.aroundReceive(TaskManager.scala:122)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
at akka.actor.ActorCell.invoke(ActorCell.scala:487)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
at akka.dispatch.Mailbox.run(Mailbox.scala:220)
at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:397)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
2017-03-15 10:29:10,516 INFO org.apache.flink.runtime.taskmanager.Task - Triggering cancellation of task code Flat Map (1/1) (dd555e0437867c3180a1ecaf0a9f4d04).
2017-03-15 10:29:10,516 INFO org.apache.flink.runtime.taskmanager.TaskManager - Disassociating from JobManager
2017-03-15 10:29:10,525 INFO org.apache.flink.runtime.blob.BlobCache - Shutting down BlobCache
2017-03-15 10:29:10,542 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink#1.2.3.4:44779] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://flink#1.2.3.4:44779]] Caused by: [Connection refused: /1.2.3.4:44779]
2017-03-15 10:29:10,546 INFO org.apache.flink.runtime.taskmanager.Task - Freeing task resources for Source: Custom Source -> Flat Map (1/1) (75fd495cc6acfd72fbe957e60e513223).
2017-03-15 10:29:10,548 INFO org.apache.flink.runtime.taskmanager.Task - Freeing task resources for Flat Map (1/1) (dd555e0437867c3180a1ecaf0a9f4d04).
2017-03-15 10:29:10,551 INFO org.apache.flink.core.fs.FileSystem - Ensuring all FileSystem streams are closed for Flat Map (1/1)
2017-03-15 10:29:10,552 INFO org.apache.flink.runtime.taskmanager.TaskManager - Trying to register at JobManager akka.tcp://flink#1.2.3.5:43893/user/jobmanager (attempt 1, timeout: 500 milliseconds)
2017-03-15 10:29:10,567 INFO org.apache.flink.core.fs.FileSystem - Ensuring all FileSystem streams are closed for Source: Custom Source -> Flat Map (1/1)
2017-03-15 10:29:10,632 INFO org.apache.flink.runtime.taskmanager.TaskManager - Successful registration at JobManager (akka.tcp://flink#1.2.3.5:43893/user/jobmanager), starting network stack and library cache.
2017-03-15 10:29:10,633 INFO org.apache.flink.runtime.taskmanager.TaskManager - Determined BLOB server address to be /1.2.3.5:42830. Starting BLOB cache.
2017-03-15 10:29:10,633 INFO org.apache.flink.runtime.blob.BlobCache - Created BLOB cache storage directory /tmp/blobStore-d97e08db-d2f1-4f00-a7d1-30c2f5823934
2017-03-15 10:29:15,551 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink#1.2.3.4:44779] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://flink#1.2.3.4:44779]] Caused by: [Connection refused: /1.2.3.4:44779]
2017-03-15 10:29:20,571 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink#1.2.3.4:44779] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://flink#1.2.3.4:44779]] Caused by: [Connection refused: /1.2.3.4:44779]
2017-03-15 10:29:25,582 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink#1.2.3.4:44779] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://flink#1.2.3.4:44779]] Caused by: [Connection refused: /1.2.3.4:44779]
2017-03-15 10:29:30,592 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink#1.2.3.4:44779] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://flink#1.2.3.4:44779]] Caused by: [Connection refused: /1.2.3.4:44779]
Does anyone know why the TaskManagers are not trying to reconnect to one of the remaining JobManagers (like 1.2.3.5 above)?
Thanks!
For everyone facing the same issue, HA requires you to provide a DFS location accessible from all nodes. I had backend state checkpoint directory and zookeeper storage directory pointing on each VM to a local filesystem location and when one of the JobManagers went down the new leader couldn't resume the running jobs because of lack of information / location not accessible.
Edit: Since this was asked, the file I modified (In the case of Apache Flink 1.2 (https://ci.apache.org/projects/flink/flink-docs-release-1.2/setup/config.html)) was
conf/flink-conf.yaml
I set
state.backend.fs.checkpointdir
high-availability.zookeeper.storageDir
to AWS S3 paths .accessible from both TaskManagers and JobManagers.