Remote connection to [null] failed with java.net.NoRouteToHostException: No route to host in taskmanager - apache-flink

When I start my apache flink 1.10 taskmanager service in kubernetes(v1.15.2) cluster,it shows logs like this:
2020-05-01 08:34:55,847 INFO org.apache.flink.runtime.taskexecutor.TaskExecutor - Could not resolve ResourceManager address akka.tcp://flink#flink-jobmanager:6123/user/resourcemanager, retrying in 10000 ms: Could not connect to rpc endpoint under address akka.tcp://flink#flink-jobmanager:6123/user/resourcemanager..
2020-05-01 08:34:55,847 WARN akka.remote.transport.netty.NettyTransport - Remote connection to [null] failed with java.net.NoRouteToHostException: No route to host
2020-05-01 08:34:55,848 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink#flink-jobmanager:6123] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink#flink-jobmanager:6123]] Caused by: [java.net.NoRouteToHostException: No route to host]
2020-05-01 08:35:08,874 WARN akka.remote.transport.netty.NettyTransport - Remote connection to [null] failed with java.net.NoRouteToHostException: No route to host
2020-05-01 08:35:08,877 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink#flink-jobmanager:6123] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink#flink-jobmanager:6123]] Caused by: [java.net.NoRouteToHostException: No route to host]
2020-05-01 08:35:08,878 INFO org.apache.flink.runtime.taskexecutor.TaskExecutor - Could not resolve ResourceManager address akka.tcp://flink#flink-jobmanager:6123/user/resourcemanager, retrying in 10000 ms: Could not connect to rpc endpoint under address akka.tcp://flink#flink-jobmanager:6123/user/resourcemanager..
2020-05-01 08:35:21,907 WARN akka.remote.transport.netty.NettyTransport - Remote connection to [null] failed with java.net.NoRouteToHostException: No route to host
and the taskmanager could not registered success, and I logged into taskmanager and find out I could success ping jobmanager liket this:
flink#flink-taskmanager-54d85f57c7-nl9cf:~$ ping flink-jobmanager
PING flink-jobmanager.dabai-fat.svc.cluster.local (10.254.58.171) 56(84) bytes of data.
64 bytes from flink-jobmanager.dabai-fat.svc.cluster.local (10.254.58.171): icmp_seq=1 ttl=64 time=0.045 ms
64 bytes from flink-jobmanager.dabai-fat.svc.cluster.local (10.254.58.171): icmp_seq=2 ttl=64 time=0.076 ms
64 bytes from flink-jobmanager.dabai-fat.svc.cluster.local (10.254.58.171): icmp_seq=3 ttl=64 time=0.079 ms
so why this would happen and what should I do to fix it?

Try to install nmap in your kubernetes taskmanger's pod container:
apt-get udpate
apt-get install nmap -y
then scan the jobmanager and make sure the pod's expose port 6123 is accessable(in my case ,I found could not access the port 6123 from current pod).
nmap -T4 <your-jobmanager's-pod-ip>
Hope this help.

Related

Apache Flink and the Zookeeper high availability doesn't work as expected

I have deployed an standalone flink( 1.15.0) cluster with 3 masters and i am using Zookeeper(3.5.0) to provide high availability. Here i share my flink.yml configuration:
high-availability: zookeeper
high-availability.storageDir: s3://bucket-name/flink
high-availability.zookeeper.quorum: zookeeper-dns:2181
state.checkpoints.dir: s3://bucket-name/flink/checkpoints
high-availability.cluster-id: flinkId
The problem is when for some reason all 3 jobmanagers fail, for example the first 1 stops and then starts again, then the second one stops and starts again and when the third one stops, the taskmanagers can't connect anymore to job managers.
I can see this logs:
2022-09-01 23:22:50,616 INFO org.apache.flink.runtime.leaderretrieval.DefaultLeaderRetrievalService [] - Starting DefaultLeaderRetrievalService with ZookeeperLeaderRetrievalDriver{connectionInformationPath='/resource_manager/connection_info'}.
2022-09-01 23:22:50,626 INFO org.apache.flink.runtime.leaderretrieval.DefaultLeaderRetrievalService [] - Starting DefaultLeaderRetrievalService with ZookeeperLeaderRetrievalDriver{connectionInformationPath='/dispatcher/connection_info'}.
2022-09-01 23:22:50,698 WARN akka.remote.transport.netty.NettyTransport 2022-09-01 23:22:50,705 WARN akka.remote.ReliableDeliverySupervisor [] - Association with remote system [akka.tcp://flink#127.0.0.1:50505] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink#127.0.0.1:50505]] Caused by: [java.net.ConnectException: Connection refused: /127.0.0.1:50505]
2022-09-01 23:22:50,698 WARN akka.remote.transport.netty.NettyTransport 2022-09-01 23:22:50,705 WARN akka.remote.ReliableDeliverySupervisor [] - Association with remote system [akka.tcp://flink#127.0.0.1:50505] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink#127.0.0.1:50505]] Caused by: [java.net.ConnectException: Connection refused: /127.0.0.1:50505]

Flink 1.10.0 - The heartbeat of ResourceManager with id xxxx timed out

I am running flink standalone cluster HA in kubernetes. The same setup runs perfectly when using Flink 1.9 but getting below error continuously when using Flink 1.10.
INFO org.apache.flink.runtime.taskexecutor.TaskExecutor - The heartbeat of ResourceManager with id 783439e4ead380c60498e32a8e1c0ce3 timed out.
DEBUG org.apache.flink.runtime.taskexecutor.TaskExecutor - Close ResourceManager connection 783439e4ead380c60498e32a8e1c0ce3.
org.apache.flink.runtime.taskexecutor.exceptions.TaskManagerException: The heartbeat of ResourceManager with id 783439e4ead380c60498e32a8e1c0ce3 timed out.
at org.apache.flink.runtime.taskexecutor.TaskExecutor$ResourceManagerHeartbeatListener.notifyHeartbeatTimeout(TaskExecutor.java:1842)
at org.apache.flink.runtime.heartbeat.HeartbeatMonitorImpl.run(HeartbeatMonitorImpl.java:109)
flink-conf.yaml :
jobmanager.rpc.address: xx.xxx.xx.xxx
jobmanager.rpc.port: 6123
jobmanager.heap.size: 1500m
taskmanager.memory.process.size: 4000m
taskmanager.numberOfTaskSlots: 1
parallelism.default: 1
jobmanager.execution.failover-strategy: region
state.backend: filesystem
state.checkpoints.dir: file:///checkpoints
state.savepoints.dir: file:///savepoints
high-availability: zookeeper
high-availability.jobmanager.port: 50010
high-availability.zookeeper.quorum: xx.xx.xx.xx:xxxx
high-availability.zookeeper.path.root: /flink
high-availability.cluster-id: /ABCD
high-availability.storageDir: file:///recovery
heartbeat.interval: 60000
heartbeat.timeout: 60000
taskmanager.debug.memory.log: true
taskmanager.debug.memory.log-interval: 10000
taskmanager.memory.managed.fraction: 0.1
blob.server.port: 6124
query.server.port: 6125

flink task manager could not register at job manager

I'm trying to create simple multi node flink cluster (1 master 1 slave). When I start my cluster using "./bin/start-cluster.sh", both job manager and task manager are started, but the task manager is not able to register at the job manager. After few minutes of trying, the task manager dies.
Details about the environment:
I'm working with Google cloud VMs. OS is Ubuntu x86_64
tried with flink versions flink-1.7.2 and flink-1.8.0. Both gave the same error.
job manager hostname = ubuntu-test-1 (10.142.0.40)task manager hostname = ubuntu-test-2 (10.142.15.250)
$ cat conf/flink-conf.yaml:
env.java.home: /opt/sample/include/jdk
jobmanager.rpc.address: 10.142.0.40
jobmanager.rpc.port: 6123
jobmanager.heap.size: 1024m
taskmanager.heap.size: 1024m
taskmanager.numberOfTaskSlots: 1
parallelism.default: 1
rest.port: 8081
$cat conf/masters
10.142.0.40:8081
$ cat conf/slaves
10.142.15.250
Below is the complete log from task manager:
2019-06-25 05:44:36,335 INFO org.apache.flink.runtime.taskexecutor.TaskManagerRunner - --------------------------------------------------------------------------------
2019-06-25 05:44:36,336 INFO org.apache.flink.runtime.taskexecutor.TaskManagerRunner - Starting TaskManager (Version: 1.7.2, Rev:ceba8af, Date:11.02.2019 # 14:17:09 UTC)
2019-06-25 05:44:36,337 INFO org.apache.flink.runtime.taskexecutor.TaskManagerRunner - OS current user: sample
2019-06-25 05:44:36,337 INFO org.apache.flink.runtime.taskexecutor.TaskManagerRunner - Current Hadoop/Kerberos user: <no hadoop dependency found>
2019-06-25 05:44:36,337 INFO org.apache.flink.runtime.taskexecutor.TaskManagerRunner - JVM: Java HotSpot(TM) 64-Bit Server VM - Oracle Corporation - 1.8/25.121-b13
2019-06-25 05:44:36,337 INFO org.apache.flink.runtime.taskexecutor.TaskManagerRunner - Maximum heap size: 922 MiBytes
2019-06-25 05:44:36,337 INFO org.apache.flink.runtime.taskexecutor.TaskManagerRunner - JAVA_HOME: (not set)
2019-06-25 05:44:36,337 INFO org.apache.flink.runtime.taskexecutor.TaskManagerRunner - No Hadoop Dependency available
2019-06-25 05:44:36,337 INFO org.apache.flink.runtime.taskexecutor.TaskManagerRunner - JVM Options:
2019-06-25 05:44:36,337 INFO org.apache.flink.runtime.taskexecutor.TaskManagerRunner - -XX:+UseG1GC
2019-06-25 05:44:36,338 INFO org.apache.flink.runtime.taskexecutor.TaskManagerRunner - -Xms922M
2019-06-25 05:44:36,338 INFO org.apache.flink.runtime.taskexecutor.TaskManagerRunner - -Xmx922M
2019-06-25 05:44:36,338 INFO org.apache.flink.runtime.taskexecutor.TaskManagerRunner - -XX:MaxDirectMemorySize=8388607T
2019-06-25 05:44:36,338 INFO org.apache.flink.runtime.taskexecutor.TaskManagerRunner - -Dlog.file=/var/tmp/flink-1.7.2/log/flink-sample-taskexecutor-0-ubuntu-test-2.log
2019-06-25 05:44:36,338 INFO org.apache.flink.runtime.taskexecutor.TaskManagerRunner - -Dlog4j.configuration=file:/var/tmp/flink-1.7.2/conf/log4j.properties
2019-06-25 05:44:36,338 INFO org.apache.flink.runtime.taskexecutor.TaskManagerRunner - -Dlogback.configurationFile=file:/var/tmp/flink-1.7.2/conf/logback.xml
2019-06-25 05:44:36,338 INFO org.apache.flink.runtime.taskexecutor.TaskManagerRunner - Program Arguments:
2019-06-25 05:44:36,338 INFO org.apache.flink.runtime.taskexecutor.TaskManagerRunner - --configDir
2019-06-25 05:44:36,338 INFO org.apache.flink.runtime.taskexecutor.TaskManagerRunner - /var/tmp/flink-1.7.2/conf
2019-06-25 05:44:36,338 INFO org.apache.flink.runtime.taskexecutor.TaskManagerRunner - Classpath: /var/tmp/flink-1.7.2/lib/flink-python_2.11-1.7.2.jar:/var/tmp/flink-1.7.2/lib/log4j-1.2.17.jar:/var/tmp/flink-1.7.2/lib/slf4j-log4j12-1.7.15.jar:/var/tmp/flink-1.7.2/lib/flink-dist_2.11-1.7.2.jar:::
2019-06-25 05:44:36,338 INFO org.apache.flink.runtime.taskexecutor.TaskManagerRunner - --------------------------------------------------------------------------------
2019-06-25 05:44:36,339 INFO org.apache.flink.runtime.taskexecutor.TaskManagerRunner - Registered UNIX signal handlers for [TERM, HUP, INT]
2019-06-25 05:44:36,343 INFO org.apache.flink.runtime.taskexecutor.TaskManagerRunner - Maximum number of open file descriptors is 100000.
2019-06-25 05:44:36,352 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: env.java.home, /opt/sample/include/jdk
2019-06-25 05:44:36,353 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: jobmanager.rpc.address, 10.142.0.40
2019-06-25 05:44:36,353 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: jobmanager.rpc.port, 6123
2019-06-25 05:44:36,353 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: jobmanager.heap.size, 1024m
2019-06-25 05:44:36,353 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: taskmanager.heap.size, 1024m
2019-06-25 05:44:36,353 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: taskmanager.numberOfTaskSlots, 1
2019-06-25 05:44:36,353 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: parallelism.default, 1
2019-06-25 05:44:36,354 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: rest.port, 8081
2019-06-25 05:44:36,360 INFO org.apache.flink.core.fs.FileSystem - Hadoop is not in the classpath/dependencies. The extended set of supported File Systems via Hadoop is not available.
2019-06-25 05:44:36,376 INFO org.apache.flink.runtime.security.modules.HadoopModuleFactory - Cannot create Hadoop Security Module because Hadoop cannot be found in the Classpath.
2019-06-25 05:44:36,395 INFO org.apache.flink.runtime.security.SecurityUtils - Cannot install HadoopSecurityContext because Hadoop cannot be found in the Classpath.
2019-06-25 05:44:36,559 WARN org.apache.flink.configuration.Configuration - Config uses deprecated configuration key 'jobmanager.rpc.address' instead of proper key 'rest.address'
2019-06-25 05:44:36,563 INFO org.apache.flink.runtime.util.LeaderRetrievalUtils - Trying to select the network interface and address to use by connecting to the leading JobManager.
2019-06-25 05:44:36,564 INFO org.apache.flink.runtime.util.LeaderRetrievalUtils - TaskManager will try to connect for 10000 milliseconds before falling back to heuristics
2019-06-25 05:44:36,567 INFO org.apache.flink.runtime.net.ConnectionUtils - Retrieved new target address /10.142.0.40:6123.
2019-06-25 05:44:36,571 INFO org.apache.flink.runtime.taskexecutor.TaskManagerRunner - TaskManager will use hostname/address 'ubuntu-test-2' (10.142.15.250) for communication.
2019-06-25 05:44:36,574 INFO org.apache.flink.runtime.rpc.akka.AkkaRpcServiceUtils - Trying to start actor system at ubuntu-test-2:0
2019-06-25 05:44:36,935 INFO akka.event.slf4j.Slf4jLogger - Slf4jLogger started
2019-06-25 05:44:37,004 INFO akka.remote.Remoting - Starting remoting
2019-06-25 05:44:37,108 INFO akka.remote.Remoting - Remoting started; listening on addresses :[akka.tcp://flink#ubuntu-test-2:33391]
2019-06-25 05:44:37,115 INFO org.apache.flink.runtime.rpc.akka.AkkaRpcServiceUtils - Actor system started at akka.tcp://flink#ubuntu-test-2:33391
2019-06-25 05:44:37,121 INFO org.apache.flink.runtime.taskexecutor.TaskManagerRunner - Trying to start actor system at ubuntu-test-2:0
2019-06-25 05:44:37,138 INFO akka.event.slf4j.Slf4jLogger - Slf4jLogger started
2019-06-25 05:44:37,144 INFO akka.remote.Remoting - Starting remoting
2019-06-25 05:44:37,152 INFO akka.remote.Remoting - Remoting started; listening on addresses :[akka.tcp://flink-metrics#ubuntu-test-2:46253]
2019-06-25 05:44:37,153 INFO org.apache.flink.runtime.taskexecutor.TaskManagerRunner - Actor system started at akka.tcp://flink-metrics#ubuntu-test-2:46253
2019-06-25 05:44:37,166 INFO org.apache.flink.runtime.metrics.MetricRegistryImpl - No metrics reporter configured, no metrics will be exposed/reported.
2019-06-25 05:44:37,171 INFO org.apache.flink.runtime.blob.PermanentBlobCache - Created BLOB cache storage directory /tmp/blobStore-4219e8ab-64ab-4eff-8320-8a50b550959d
2019-06-25 05:44:37,174 INFO org.apache.flink.runtime.blob.TransientBlobCache - Created BLOB cache storage directory /tmp/blobStore-959579c0-4892-4ba8-b7d3-63969e84f554
2019-06-25 05:44:37,175 INFO org.apache.flink.runtime.taskexecutor.TaskManagerRunner - Starting TaskManager with ResourceID: 3743bd08e81673b79e96d98ebab7a58a
2019-06-25 05:44:37,179 INFO org.apache.flink.runtime.io.network.netty.NettyConfig - NettyConfig [server address: ubuntu-test-2/10.142.15.250, server port: 0, ssl enabled: false, memory segment size (bytes): 32768, transport type: NIO, number of server threads: 1 (manual), number of client threads: 1 (manual), server connect backlog: 0 (use Netty's default), client connect timeout (sec): 120, send/receive buffer size (bytes): 0 (use Netty's default)]
2019-06-25 05:44:37,224 INFO org.apache.flink.runtime.taskexecutor.TaskManagerServices - Temporary file directory '/tmp': total 96 GB, usable 86 GB (89.58% usable)
2019-06-25 05:44:37,305 INFO org.apache.flink.runtime.io.network.buffer.NetworkBufferPool - Allocated 102 MB for network buffer pool (number of memory segments: 3278, bytes per segment: 32768).
2019-06-25 05:44:37,354 INFO org.apache.flink.runtime.query.QueryableStateUtils - Could not load Queryable State Client Proxy. Probable reason: flink-queryable-state-runtime is not in the classpath. To enable Queryable State, please move the flink-queryable-state-runtime jar from the opt to the lib folder.
2019-06-25 05:44:37,355 INFO org.apache.flink.runtime.query.QueryableStateUtils - Could not load Queryable State Server. Probable reason: flink-queryable-state-runtime is not in the classpath. To enable Queryable State, please move the flink-queryable-state-runtime jar from the opt to the lib folder.
2019-06-25 05:44:37,357 INFO org.apache.flink.runtime.io.network.NetworkEnvironment - Starting the network environment and its components.
2019-06-25 05:44:37,389 INFO org.apache.flink.runtime.io.network.netty.NettyClient - Successful initialization (took 30 ms).
2019-06-25 05:44:37,432 INFO org.apache.flink.runtime.io.network.netty.NettyServer - Successful initialization (took 42 ms). Listening on SocketAddress /10.142.15.250:41521.
2019-06-25 05:44:37,433 INFO org.apache.flink.runtime.taskexecutor.TaskManagerServices - Limiting managed memory to 0.7 of the currently free heap space (640 MB), memory will be allocated lazily.
2019-06-25 05:44:37,436 INFO org.apache.flink.runtime.io.disk.iomanager.IOManager - I/O manager uses directory /tmp/flink-io-9b6408aa-3a29-477b-8a4b-661401bad5b6 for spill files.
2019-06-25 05:44:37,496 INFO org.apache.flink.runtime.taskexecutor.TaskManagerConfiguration - Messages have a max timeout of 10000 ms
2019-06-25 05:44:37,503 INFO org.apache.flink.runtime.rpc.akka.AkkaRpcService - Starting RPC endpoint for org.apache.flink.runtime.taskexecutor.TaskExecutor at akka://flink/user/taskmanager_0 .
2019-06-25 05:44:37,520 INFO org.apache.flink.runtime.taskexecutor.JobLeaderService - Start job leader service.
2019-06-25 05:44:37,521 INFO org.apache.flink.runtime.taskexecutor.TaskExecutor - Connecting to ResourceManager akka.tcp://flink#10.142.0.40:6123/user/resourcemanager(00000000000000000000000000000000).
2019-06-25 05:44:37,521 INFO org.apache.flink.runtime.filecache.FileCache - User file cache uses directory /tmp/flink-dist-cache-504118c3-1bc2-4624-b1c4-7eacce681ba9
2019-06-25 05:44:47,542 INFO org.apache.flink.runtime.taskexecutor.TaskExecutor - Could not resolve ResourceManager address akka.tcp://flink#10.142.0.40:6123/user/resourcemanager, retrying in 10000 ms: Ask timed out on [ActorSelection[Anchor(akka.tcp://flink#10.142.0.40:6123/), Path(/user/resourcemanager)]] after [10000 ms]. Sender[null] sent message of type "akka.actor.Identify"..
2019-06-25 05:45:07,580 INFO org.apache.flink.runtime.taskexecutor.TaskExecutor - Could not resolve ResourceManager address akka.tcp://flink#10.142.0.40:6123/user/resourcemanager, retrying in 10000 ms: Ask timed out on [ActorSelection[Anchor(akka.tcp://flink#10.142.0.40:6123/), Path(/user/resourcemanager)]] after [10000 ms]. Sender[null] sent message of type "akka.actor.Identify"..
2019-06-25 05:45:27,620 INFO org.apache.flink.runtime.taskexecutor.TaskExecutor - Could not resolve ResourceManager address akka.tcp://flink#10.142.0.40:6123/user/resourcemanager, retrying in 10000 ms: Ask timed out on [ActorSelection[Anchor(akka.tcp://flink#10.142.0.40:6123/), Path(/user/resourcemanager)]] after [10000 ms]. Sender[null] sent message of type "akka.actor.Identify"..
2019-06-25 05:45:47,660 INFO org.apache.flink.runtime.taskexecutor.TaskExecutor - Could not resolve ResourceManager address akka.tcp://flink#10.142.0.40:6123/user/resourcemanager, retrying in 10000 ms: Ask timed out on [ActorSelection[Anchor(akka.tcp://flink#10.142.0.40:6123/), Path(/user/resourcemanager)]] after [10000 ms]. Sender[null] sent message of type "akka.actor.Identify"..
2019-06-25 05:46:07,700 INFO org.apache.flink.runtime.taskexecutor.TaskExecutor - Could not resolve ResourceManager address akka.tcp://flink#10.142.0.40:6123/user/resourcemanager, retrying in 10000 ms: Ask timed out on [ActorSelection[Anchor(akka.tcp://flink#10.142.0.40:6123/), Path(/user/resourcemanager)]] after [10000 ms]. Sender[null] sent message of type "akka.actor.Identify"..
2019-06-25 05:46:27,741 INFO org.apache.flink.runtime.taskexecutor.TaskExecutor - Could not resolve ResourceManager address akka.tcp://flink#10.142.0.40:6123/user/resourcemanager, retrying in 10000 ms: Ask timed out on [ActorSelection[Anchor(akka.tcp://flink#10.142.0.40:6123/), Path(/user/resourcemanager)]] after [10000 ms]. Sender[null] sent message of type "akka.actor.Identify"..
2019-06-25 05:46:47,780 INFO org.apache.flink.runtime.taskexecutor.TaskExecutor - Could not resolve ResourceManager address akka.tcp://flink#10.142.0.40:6123/user/resourcemanager, retrying in 10000 ms: Ask timed out on [ActorSelection[Anchor(akka.tcp://flink#10.142.0.40:6123/), Path(/user/resourcemanager)]] after [10000 ms]. Sender[null] sent message of type "akka.actor.Identify"..
2019-06-25 05:47:07,820 INFO org.apache.flink.runtime.taskexecutor.TaskExecutor - Could not resolve ResourceManager address akka.tcp://flink#10.142.0.40:6123/user/resourcemanager, retrying in 10000 ms: Ask timed out on [ActorSelection[Anchor(akka.tcp://flink#10.142.0.40:6123/), Path(/user/resourcemanager)]] after [10000 ms]. Sender[null] sent message of type "akka.actor.Identify"..
2019-06-25 05:47:27,860 INFO org.apache.flink.runtime.taskexecutor.TaskExecutor - Could not resolve ResourceManager address akka.tcp://flink#10.142.0.40:6123/user/resourcemanager, retrying in 10000 ms: Ask timed out on [ActorSelection[Anchor(akka.tcp://flink#10.142.0.40:6123/), Path(/user/resourcemanager)]] after [10000 ms]. Sender[null] sent message of type "akka.actor.Identify"..
2019-06-25 05:47:47,900 INFO org.apache.flink.runtime.taskexecutor.TaskExecutor - Could not resolve ResourceManager address akka.tcp://flink#10.142.0.40:6123/user/resourcemanager, retrying in 10000 ms: Ask timed out on [ActorSelection[Anchor(akka.tcp://flink#10.142.0.40:6123/), Path(/user/resourcemanager)]] after [10000 ms]. Sender[null] sent message of type "akka.actor.Identify"..
2019-06-25 05:48:07,940 INFO org.apache.flink.runtime.taskexecutor.TaskExecutor - Could not resolve ResourceManager address akka.tcp://flink#10.142.0.40:6123/user/resourcemanager, retrying in 10000 ms: Ask timed out on [ActorSelection[Anchor(akka.tcp://flink#10.142.0.40:6123/), Path(/user/resourcemanager)]] after [10000 ms]. Sender[null] sent message of type "akka.actor.Identify"..
2019-06-25 05:48:27,980 INFO org.apache.flink.runtime.taskexecutor.TaskExecutor - Could not resolve ResourceManager address akka.tcp://flink#10.142.0.40:6123/user/resourcemanager, retrying in 10000 ms: Ask timed out on [ActorSelection[Anchor(akka.tcp://flink#10.142.0.40:6123/), Path(/user/resourcemanager)]] after [10000 ms]. Sender[null] sent message of type "akka.actor.Identify"..
2019-06-25 05:48:48,020 INFO org.apache.flink.runtime.taskexecutor.TaskExecutor - Could not resolve ResourceManager address akka.tcp://flink#10.142.0.40:6123/user/resourcemanager, retrying in 10000 ms: Ask timed out on [ActorSelection[Anchor(akka.tcp://flink#10.142.0.40:6123/), Path(/user/resourcemanager)]] after [10000 ms]. Sender[null] sent message of type "akka.actor.Identify"..
2019-06-25 05:49:08,060 INFO org.apache.flink.runtime.taskexecutor.TaskExecutor - Could not resolve ResourceManager address akka.tcp://flink#10.142.0.40:6123/user/resourcemanager, retrying in 10000 ms: Ask timed out on [ActorSelection[Anchor(akka.tcp://flink#10.142.0.40:6123/), Path(/user/resourcemanager)]] after [10000 ms]. Sender[null] sent message of type "akka.actor.Identify"..
2019-06-25 05:49:28,100 INFO org.apache.flink.runtime.taskexecutor.TaskExecutor - Could not resolve ResourceManager address akka.tcp://flink#10.142.0.40:6123/user/resourcemanager, retrying in 10000 ms: Ask timed out on [ActorSelection[Anchor(akka.tcp://flink#10.142.0.40:6123/), Path(/user/resourcemanager)]] after [10000 ms]. Sender[null] sent message of type "akka.actor.Identify"..
2019-06-25 05:49:37,541 ERROR org.apache.flink.runtime.taskexecutor.TaskExecutor - Fatal error occurred in TaskExecutor akka.tcp://flink#ubuntu-test-2:33391/user/taskmanager_0.
org.apache.flink.runtime.taskexecutor.exceptions.RegistrationTimeoutException: Could not register at the ResourceManager within the specified maximum registration duration 300000 ms. This indicates a problem with this instance. Terminating now.
at org.apache.flink.runtime.taskexecutor.TaskExecutor.registrationTimeout(TaskExecutor.java:1037)
at org.apache.flink.runtime.taskexecutor.TaskExecutor.lambda$startRegistrationTimeout$3(TaskExecutor.java:1023)
at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:332)
at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:158)
at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.onReceive(AkkaRpcActor.java:142)
at akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:165)
at akka.actor.Actor$class.aroundReceive(Actor.scala:502)
at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:95)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)
at akka.actor.ActorCell.invoke(ActorCell.scala:495)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)
at akka.dispatch.Mailbox.run(Mailbox.scala:224)
at akka.dispatch.Mailbox.exec(Mailbox.scala:234)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
2019-06-25 05:49:37,544 ERROR org.apache.flink.runtime.taskexecutor.TaskManagerRunner - Fatal error occurred while executing the TaskManager. Shutting it down...
org.apache.flink.runtime.taskexecutor.exceptions.RegistrationTimeoutException: Could not register at the ResourceManager within the specified maximum registration duration 300000 ms. This indicates a problem with this instance. Terminating now.
at org.apache.flink.runtime.taskexecutor.TaskExecutor.registrationTimeout(TaskExecutor.java:1037)
at org.apache.flink.runtime.taskexecutor.TaskExecutor.lambda$startRegistrationTimeout$3(TaskExecutor.java:1023)
at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:332)
at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:158)
at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.onReceive(AkkaRpcActor.java:142)
at akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:165)
at akka.actor.Actor$class.aroundReceive(Actor.scala:502)
at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:95)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)
at akka.actor.ActorCell.invoke(ActorCell.scala:495)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)
at akka.dispatch.Mailbox.run(Mailbox.scala:224)
at akka.dispatch.Mailbox.exec(Mailbox.scala:234)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
2019-06-25 05:49:37,550 INFO org.apache.flink.runtime.taskexecutor.TaskExecutor - Stopping TaskExecutor akka.tcp://flink#ubuntu-test-2:33391/user/taskmanager_0.
2019-06-25 05:49:37,551 INFO org.apache.flink.runtime.state.TaskExecutorLocalStateStoresManager - Shutting down TaskExecutorLocalStateStoresManager.
2019-06-25 05:49:37,554 INFO org.apache.flink.runtime.io.disk.iomanager.IOManager - I/O manager removed spill file directory /tmp/flink-io-9b6408aa-3a29-477b-8a4b-661401bad5b6
2019-06-25 05:49:37,554 INFO org.apache.flink.runtime.io.network.NetworkEnvironment - Shutting down the network environment and its components.
2019-06-25 05:49:37,554 INFO org.apache.flink.runtime.io.network.netty.NettyClient - Successful shutdown (took 0 ms).
2019-06-25 05:49:37,555 INFO org.apache.flink.runtime.io.network.netty.NettyServer - Successful shutdown (took 0 ms).
2019-06-25 05:49:37,561 INFO org.apache.flink.runtime.taskexecutor.JobLeaderService - Stop job leader service.
2019-06-25 05:49:37,562 INFO org.apache.flink.runtime.taskexecutor.TaskExecutor - Stopped TaskExecutor akka.tcp://flink#ubuntu-test-2:33391/user/taskmanager_0.
2019-06-25 05:49:37,563 INFO org.apache.flink.runtime.blob.PermanentBlobCache - Shutting down BLOB cache
2019-06-25 05:49:37,563 INFO org.apache.flink.runtime.blob.TransientBlobCache - Shutting down BLOB cache
2019-06-25 05:49:37,570 INFO org.apache.flink.runtime.rpc.akka.AkkaRpcService - Stopping Akka RPC service.
2019-06-25 05:49:37,576 INFO akka.remote.RemoteActorRefProvider$RemotingTerminator - Shutting down remote daemon.
2019-06-25 05:49:37,577 INFO akka.remote.RemoteActorRefProvider$RemotingTerminator - Remote daemon shut down; proceeding with flushing remote transports.
2019-06-25 05:49:37,580 INFO akka.remote.RemoteActorRefProvider$RemotingTerminator - Shutting down remote daemon.
2019-06-25 05:49:37,584 INFO akka.remote.RemoteActorRefProvider$RemotingTerminator - Remote daemon shut down; proceeding with flushing remote transports.
2019-06-25 05:49:37,596 INFO akka.remote.RemoteActorRefProvider$RemotingTerminator - Remoting shut down.
2019-06-25 05:49:37,597 INFO akka.remote.RemoteActorRefProvider$RemotingTerminator - Remoting shut down. 41,1 Top
Looks like the problem was that I used IP addresses instead of hostnames. This was already pointed out in some other thread on SO. When I read that thread, I thought the reason was because IP addresses can change over time for the same host. Looks like, using IP addresses does not work, even if they don't change.
Wondering why then, in flink documentation, they showed IP addresses.
https://ci.apache.org/projects/flink/flink-docs-release-1.8/ops/deployment/cluster_setup.html
I had same issue,
make sure you are using jdk-1.8 as flink 1.7.2 need jdk-1.8, worked for me!
check if below environment variable set, while docker setup.
FLINK_PROPERTIES="jobmanager.rpc.address: jobmanager"
or check jobmanager.rpc.address configuration in other cases.

Flink HA Cluster JobManager issues

I have a setup with flink 1.2 cluster, made up of 3 JobManagers and 2 TaskManagers. I start the Zookeeper Quorum from JobManager1, I get confirmation Zookeeper starts on the other 2 JobManagers then I start a Flink job on this JobManager1.
The flink-conf.yaml is the same on all 5 VMs this means jobmanager.rpc.address: points to JobManager1 everywhere.
If I turn off the VM running JobManager1 I would expect Zookeeper to say one of the remaining JobManagers is the leader and the TaskManagers should reconnect to it. Instead I get in the TaskManagers' logs a lot of these messages
2017-03-14 14:13:21,827 INFO org.apache.flink.runtime.taskmanager.TaskManager - Trying to register at JobManager akka.tcp://flink#1.2.3.4:43660/user/jobmanager (attempt 11, timeout: 30 seconds)
2017-03-14 14:13:21,836 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink#1.2.3.4:43660] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://flink#1.2.3.4:43660]] Caused by: [Connection refused: /1.2.3.4:43660]
I modified the original IP to 1.2.3.4 for confidentiality and because it's always the same IP (of JobManager1).
More logs:
2017-03-15 10:28:28,655 INFO org.apache.flink.core.fs.FileSystem - Ensuring all FileSystem streams are closed for Async calls on Source: Custom Source -> Flat Map (1/1)
2017-03-15 10:28:38,534 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink#1.2.3.4:44779] has failed, address is now gated for [5000] ms. Reason: [Disassociated]
2017-03-15 10:28:46,606 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink#1.2.3.4:44779] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://flink#1.2.3.4:44779]] Caused by: [Connection refused: /1.2.3.4:44779]
2017-03-15 10:28:52,431 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink#1.2.3.4:44779] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://flink#1.2.3.4:44779]] Caused by: [Connection refused: /1.2.3.4:44779]
2017-03-15 10:29:02,435 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink#1.2.3.4:44779] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://flink#1.2.3.4:44779]] Caused by: [Connection refused: /1.2.3.4:44779]
2017-03-15 10:29:10,489 INFO org.apache.flink.runtime.taskmanager.TaskManager - TaskManager akka://flink/user/taskmanager disconnects from JobManager akka.tcp://flink#1.2.3.4:44779/user/jobmanager: Old JobManager lost its leadership.
2017-03-15 10:29:10,490 INFO org.apache.flink.runtime.taskmanager.TaskManager - Cancelling all computations and discarding all cached data.
2017-03-15 10:29:10,491 INFO org.apache.flink.runtime.taskmanager.Task - Attempting to fail task externally Source: Custom Source -> Flat Map (1/1) (75fd495cc6acfd72fbe957e60e513223).
2017-03-15 10:29:10,491 INFO org.apache.flink.runtime.taskmanager.Task - Source: Custom Source -> Flat Map (1/1) (75fd495cc6acfd72fbe957e60e513223) switched from RUNNING to FAILED.
java.lang.Exception: TaskManager akka://flink/user/taskmanager disconnects from JobManager akka.tcp://flink#1.2.3.4:44779/user/jobmanager: Old JobManager lost its leadership.
at org.apache.flink.runtime.taskmanager.TaskManager.handleJobManagerDisconnect(TaskManager.scala:1074)
at org.apache.flink.runtime.taskmanager.TaskManager.org$apache$flink$runtime$taskmanager$TaskManager$$handleJobManagerLeaderAddress(TaskManager.scala:1426)
at org.apache.flink.runtime.taskmanager.TaskManager$$anonfun$handleMessage$1.applyOrElse(TaskManager.scala:286)
at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
at org.apache.flink.runtime.LeaderSessionMessageFilter$$anonfun$receive$1.applyOrElse(LeaderSessionMessageFilter.scala:44)
at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
at org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:33)
at org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:28)
at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123)
at org.apache.flink.runtime.LogMessages$$anon$1.applyOrElse(LogMessages.scala:28)
at akka.actor.Actor$class.aroundReceive(Actor.scala:467)
at org.apache.flink.runtime.taskmanager.TaskManager.aroundReceive(TaskManager.scala:122)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
at akka.actor.ActorCell.invoke(ActorCell.scala:487)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
at akka.dispatch.Mailbox.run(Mailbox.scala:220)
at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:397)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
2017-03-15 10:29:10,512 INFO org.apache.flink.runtime.taskmanager.Task - Triggering cancellation of task code Source: Custom Source -> Flat Map (1/1) (75fd495cc6acfd72fbe957e60e513223).
2017-03-15 10:29:10,515 INFO org.apache.flink.runtime.taskmanager.Task - Attempting to fail task externally Flat Map (1/1) (dd555e0437867c3180a1ecaf0a9f4d04).
2017-03-15 10:29:10,515 INFO org.apache.flink.runtime.taskmanager.Task - Flat Map (1/1) (dd555e0437867c3180a1ecaf0a9f4d04) switched from RUNNING to FAILED.
java.lang.Exception: TaskManager akka://flink/user/taskmanager disconnects from JobManager akka.tcp://flink#1.2.3.4:44779/user/jobmanager: Old JobManager lost its leadership.
at org.apache.flink.runtime.taskmanager.TaskManager.handleJobManagerDisconnect(TaskManager.scala:1074)
at org.apache.flink.runtime.taskmanager.TaskManager.org$apache$flink$runtime$taskmanager$TaskManager$$handleJobManagerLeaderAddress(TaskManager.scala:1426)
at org.apache.flink.runtime.taskmanager.TaskManager$$anonfun$handleMessage$1.applyOrElse(TaskManager.scala:286)
at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
at org.apache.flink.runtime.LeaderSessionMessageFilter$$anonfun$receive$1.applyOrElse(LeaderSessionMessageFilter.scala:44)
at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
at org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:33)
at org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:28)
at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123)
at org.apache.flink.runtime.LogMessages$$anon$1.applyOrElse(LogMessages.scala:28)
at akka.actor.Actor$class.aroundReceive(Actor.scala:467)
at org.apache.flink.runtime.taskmanager.TaskManager.aroundReceive(TaskManager.scala:122)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
at akka.actor.ActorCell.invoke(ActorCell.scala:487)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
at akka.dispatch.Mailbox.run(Mailbox.scala:220)
at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:397)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
2017-03-15 10:29:10,516 INFO org.apache.flink.runtime.taskmanager.Task - Triggering cancellation of task code Flat Map (1/1) (dd555e0437867c3180a1ecaf0a9f4d04).
2017-03-15 10:29:10,516 INFO org.apache.flink.runtime.taskmanager.TaskManager - Disassociating from JobManager
2017-03-15 10:29:10,525 INFO org.apache.flink.runtime.blob.BlobCache - Shutting down BlobCache
2017-03-15 10:29:10,542 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink#1.2.3.4:44779] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://flink#1.2.3.4:44779]] Caused by: [Connection refused: /1.2.3.4:44779]
2017-03-15 10:29:10,546 INFO org.apache.flink.runtime.taskmanager.Task - Freeing task resources for Source: Custom Source -> Flat Map (1/1) (75fd495cc6acfd72fbe957e60e513223).
2017-03-15 10:29:10,548 INFO org.apache.flink.runtime.taskmanager.Task - Freeing task resources for Flat Map (1/1) (dd555e0437867c3180a1ecaf0a9f4d04).
2017-03-15 10:29:10,551 INFO org.apache.flink.core.fs.FileSystem - Ensuring all FileSystem streams are closed for Flat Map (1/1)
2017-03-15 10:29:10,552 INFO org.apache.flink.runtime.taskmanager.TaskManager - Trying to register at JobManager akka.tcp://flink#1.2.3.5:43893/user/jobmanager (attempt 1, timeout: 500 milliseconds)
2017-03-15 10:29:10,567 INFO org.apache.flink.core.fs.FileSystem - Ensuring all FileSystem streams are closed for Source: Custom Source -> Flat Map (1/1)
2017-03-15 10:29:10,632 INFO org.apache.flink.runtime.taskmanager.TaskManager - Successful registration at JobManager (akka.tcp://flink#1.2.3.5:43893/user/jobmanager), starting network stack and library cache.
2017-03-15 10:29:10,633 INFO org.apache.flink.runtime.taskmanager.TaskManager - Determined BLOB server address to be /1.2.3.5:42830. Starting BLOB cache.
2017-03-15 10:29:10,633 INFO org.apache.flink.runtime.blob.BlobCache - Created BLOB cache storage directory /tmp/blobStore-d97e08db-d2f1-4f00-a7d1-30c2f5823934
2017-03-15 10:29:15,551 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink#1.2.3.4:44779] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://flink#1.2.3.4:44779]] Caused by: [Connection refused: /1.2.3.4:44779]
2017-03-15 10:29:20,571 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink#1.2.3.4:44779] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://flink#1.2.3.4:44779]] Caused by: [Connection refused: /1.2.3.4:44779]
2017-03-15 10:29:25,582 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink#1.2.3.4:44779] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://flink#1.2.3.4:44779]] Caused by: [Connection refused: /1.2.3.4:44779]
2017-03-15 10:29:30,592 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink#1.2.3.4:44779] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://flink#1.2.3.4:44779]] Caused by: [Connection refused: /1.2.3.4:44779]
Does anyone know why the TaskManagers are not trying to reconnect to one of the remaining JobManagers (like 1.2.3.5 above)?
Thanks!
For everyone facing the same issue, HA requires you to provide a DFS location accessible from all nodes. I had backend state checkpoint directory and zookeeper storage directory pointing on each VM to a local filesystem location and when one of the JobManagers went down the new leader couldn't resume the running jobs because of lack of information / location not accessible.
Edit: Since this was asked, the file I modified (In the case of Apache Flink 1.2 (https://ci.apache.org/projects/flink/flink-docs-release-1.2/setup/config.html)) was
conf/flink-conf.yaml
I set
state.backend.fs.checkpointdir
high-availability.zookeeper.storageDir
to AWS S3 paths .accessible from both TaskManagers and JobManagers.

Which ports should I open in firewall on nodes with Apach Flink?

When I try to run my flow on Apache Flink standalone cluster I see the following exception:
java.lang.IllegalStateException: Update task on instance aaa0859f6af25decf1f5fc1821ffa55d # app-2 - 4 slots - URL: akka.tcp://flink#192.168.38.98:46369/user/taskmanager failed due to:
at org.apache.flink.runtime.executiongraph.Execution$6.onFailure(Execution.java:954)
at akka.dispatch.OnFailure.internal(Future.scala:228)
at akka.dispatch.OnFailure.internal(Future.scala:227)
at akka.dispatch.japi$CallbackBridge.apply(Future.scala:174)
at akka.dispatch.japi$CallbackBridge.apply(Future.scala:171)
at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123)
at scala.runtime.AbstractPartialFunction.applyOrElse(AbstractPartialFunction.scala:28)
at scala.concurrent.Future$$anonfun$onFailure$1.apply(Future.scala:136)
at scala.concurrent.Future$$anonfun$onFailure$1.apply(Future.scala:134)
at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)
at scala.concurrent.impl.ExecutionContextImpl$AdaptedForkJoinTask.exec(ExecutionContextImpl.scala:121)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
Caused by: akka.pattern.AskTimeoutException: Ask timed out on [Actor[akka.tcp://flink#192.168.38.98:46369/user/taskmanager#1804590378]] after [10000 ms]
at akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:333)
at akka.actor.Scheduler$$anon$7.run(Scheduler.scala:117)
at scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:599)
at scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109)
at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:597)
at akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(Scheduler.scala:467)
at akka.actor.LightArrayRevolverScheduler$$anon$8.executeBucket$1(Scheduler.scala:419)
at akka.actor.LightArrayRevolverScheduler$$anon$8.nextTick(Scheduler.scala:423)
at akka.actor.LightArrayRevolverScheduler$$anon$8.run(Scheduler.scala:375)
at java.lang.Thread.run(Thread.java:745)
Seems like port 46369 blocked by firewall. It is true because I read configuration section and open these ports only:
6121:
comment: Apache Flink TaskManager (Data Exchange)
6122:
comment: Apache Flink TaskManager (IPC)
6123:
comment: Apache Flink JobManager
6130:
comment: Apache Flink JobManager (BLOB Server)
8081:
comment: Apache Flink JobManager (Web UI)
The same ports described in flink-conf.yaml:
jobmanager.rpc.address: app-1.stag.local
jobmanager.rpc.port: 6123
jobmanager.heap.mb: 1024
taskmanager.heap.mb: 2048
taskmanager.numberOfTaskSlots: 4
taskmanager.memory.preallocate: false
blob.server.port: 6130
parallelism.default: 4
jobmanager.web.port: 8081
state.backend: jobmanager
restart-strategy: none
restart-strategy.fixed-delay.attempts: 2
restart-strategy.fixed-delay.delay: 60s
So, I have two questions:
This exception related to blocked ports. Right?
Which ports should I open on firewall for standalone Apache Flink cluster?
UPDATE 1
I found configuration problem in masters and slaves files (I skip new line separators between hosts described in these files). I fixed it and now I see other exceptions:
flink--taskmanager-0-app-1.stag.local.log
flink--taskmanager-0-app-2.stag.local.log
I have 2 nodes:
app-1.stag.local (with running job and task managers)
app-2.stag.local (with running task manager)
As you can see from these logs the app-1.stag.local task manager can't connect to other task manager:
java.io.IOException: Connecting the channel failed: Connecting to remote task manager + 'app-2.stag.local/192.168.38.98:35806' has failed. This might indicate that the remote task manager has been lost.
but app-2.stag.local has open port:
2016-03-18 16:24:14,347 INFO org.apache.flink.runtime.io.network.netty.NettyServer - Successful initialization (took 39 ms). Listening on SocketAddress /192.168.38.98:35806
So, I think problem related to firewall but I don't understand where I can configure this port (or range of ports) in Apache Flink.
I have found a problem: taskmanager.data.port parameter was set to 0 by default (but documentation say what it should be set to 6121).
So, I set this port in flink-conf.yaml and now all works fine.

Resources