Java client fails to connect to local Flink cluster - apache-flink

I am trying out a small program with a local Flink cluster, setup according to the instructions here. The sample wordcount program runs fine, but when I attempt to run my own program, it stalls and fails while connecting to the job manager. This is Flink 1.5 with JDK 1.8
The relevant part of the code is
FlinkPipelineOptions options = PipelineOptionsFactory.as(FlinkPipelineOptions.class);
options.setStreaming(true);
options.setFlinkMaster("localhost:6123");
options.setRunner(FlinkRunner.class);
I start the cluster with start-cluster.sh, and I can see the two processes (job and task managers) are running. The logs on Flink don't have much. On the client side, after turning on debug, I can see the following
18:43:20.507 [flink-akka.actor.default-dispatcher-4] INFO akka.remote.Remoting - Remoting started; listening on addresses :[akka.tcp://flink#talonx:38183]
18:43:20.511 [main] INFO org.apache.flink.client.program.StandaloneClusterClient - Actor system started at akka.tcp://flink#talonx:38183
18:43:20.511 [main] INFO org.apache.flink.client.program.StandaloneClusterClient - Submitting job with JobID: dbf63281771465550fd3598b2b67b91f. Waiting for job completion.
Submitting job with JobID: dbf63281771465550fd3598b2b67b91f. Waiting for job completion.
18:43:20.521 [flink-akka.actor.default-dispatcher-4] INFO org.apache.flink.runtime.client.JobSubmissionClientActor - Received SubmitJobAndWait(JobGraph(jobId: dbf63281771465550fd3598b2b67b91f)) but there is no connection to a JobManager yet.
18:43:20.522 [flink-akka.actor.default-dispatcher-4] INFO org.apache.flink.runtime.client.JobSubmissionClientActor - Received job test-talonx-0618131319-b721a69a (dbf63281771465550fd3598b2b67b91f).
18:43:20.523 [flink-akka.actor.default-dispatcher-4] INFO org.apache.flink.runtime.client.JobSubmissionClientActor - Disconnect from JobManager null.
After a while, I get the following exception on the client
19:03:19.396 [main] ERROR org.apache.beam.runners.flink.FlinkRunner - Pipeline execution failed
org.apache.flink.client.program.ProgramInvocationException: The program execution failed: Couldn't retrieve the JobExecutionResult from the JobManager.
at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:492)
at org.apache.flink.client.program.StandaloneClusterClient.submitJob(StandaloneClusterClient.java:105)
at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:456)
at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:449)
at org.apache.flink.streaming.api.environment.RemoteStreamEnvironment.executeRemotely(RemoteStreamEnvironment.java:212)
at org.apache.flink.streaming.api.environment.RemoteStreamEnvironment.execute(RemoteStreamEnvironment.java:176)
at org.apache.beam.runners.flink.FlinkPipelineExecutionEnvironment.executePipeline(FlinkPipelineExecutionEnvironment.java:126)
at org.apache.beam.runners.flink.FlinkRunner.run(FlinkRunner.java:115)
at org.apache.beam.sdk.Pipeline.run(Pipeline.java:311)
at org.apache.beam.sdk.Pipeline.run(Pipeline.java:297)
at Test.main(Test.java:106)
Caused by: org.apache.flink.runtime.client.JobExecutionException: Couldn't retrieve the JobExecutionResult from the JobManager.
at org.apache.flink.runtime.client.JobClient.awaitJobResult(JobClient.java:300)
at org.apache.flink.runtime.client.JobClient.submitJobAndWait(JobClient.java:387)
at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:481)
... 10 common frames omitted
Caused by: org.apache.flink.runtime.client.JobClientActorConnectionTimeoutException: Lost connection to the JobManager.
at org.apache.flink.runtime.client.JobClientActor.handleMessage(JobClientActor.java:219)
at org.apache.flink.runtime.akka.FlinkUntypedActor.handleLeaderSessionID(FlinkUntypedActor.java:104)
at org.apache.flink.runtime.akka.FlinkUntypedActor.onReceive(FlinkUntypedActor.java:71)
at akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:165)
at akka.actor.Actor$class.aroundReceive(Actor.scala:502)
at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:95)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)
at akka.actor.ActorCell.invoke(ActorCell.scala:495)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)
at akka.dispatch.Mailbox.run(Mailbox.scala:224)
at akka.dispatch.Mailbox.exec(Mailbox.scala:234)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
What might be missing here?

Related

Flink TaskManager timeout?

I am running a Flink application (via Yarn) and it seems that randomly sometimes a task manager times out, here is the error:
java.util.concurrent.TimeoutException: Heartbeat of TaskManager with id someId timed out.
at org.apache.flink.runtime.jobmaster.JobMaster$TaskManagerHeartbeatListener.notifyHeartbeatTimeout(JobMaster.java:1610)
at org.apache.flink.runtime.heartbeat.HeartbeatManagerImpl$HeartbeatMonitor.run(HeartbeatManagerImpl.java:339)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at org.apache.flink.runtime.concurrent.akka.ActorSystemScheduledExecutorAdapter$ScheduledFutureTask.run(ActorSystemScheduledExecutorAdapter.java:154)
at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:39)
at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:415)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
This happens 30-45 minutes after the job has started running. Any ideas as to what could be causing this?
The reason of this problem maybe the membory of JobManager is too small, try make it bigger

Flink - Failure at recovering from savepoint (checkpoint) . cause by java.lang.IllegalStateException: There is no operator for the state

The problem:
The flink job-manager couldn't recovery from a checkpoint.
Caused by: java.lang.IllegalStateException: There is no operator for the state
Background:
I'm running a flink 1.6.3 over k8s. and I'm using incremental checkpoint on rocksdb.
I tryied to pass the parameter --allowNonRestoredState in order to skip savepoint state that cannot be restored
From my log:
2019-02-06 08:51:08.068 [main] INFO
org.apache.flink.runtime.entrypoint.ClusterEntrypoint -
--allowNonRestoredState
2019-02-06 08:51:22.827 [flink-akka.actor.default-dispatcher-14] INFO
o.a.f.runtime.checkpoint.ZooKeeperCompletedCheckpointStore -
Recovering checkpoints from ZooKeeper. 2019-02-06 08:51:22.883
[flink-akka.actor.default-dispatcher-14] INFO
o.a.f.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - Found 1
checkpoints in ZooKeeper. 2019-02-06 08:51:22.883
[flink-akka.actor.default-dispatcher-14] INFO
o.a.f.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - Trying
to fetch 1 checkpoints from storage. 2019-02-06 08:51:22.884
[flink-akka.actor.default-dispatcher-14] INFO
o.a.f.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - Trying
to retrieve checkpoint 1612. 2019-02-06 08:51:22.977
[flink-akka.actor.default-dispatcher-14] INFO
org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Restoring
job 00000000000000000000000000000000 from latest valid checkpoint:
Checkpoint 1612 # 1549376250641 for 00000000000000000000000000000000.
2019-02-06 08:51:22.982 [flink-akka.actor.default-dispatcher-14] ERROR
org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Fatal error
occurred in the cluster entrypoint. java.lang.RuntimeException:
org.apache.flink.runtime.client.JobExecutionException: Could not set
up JobManager
at org.apache.flink.util.function.CheckedSupplier.lambda$unchecked$0(CheckedSupplier.java:36)
at java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1590)
at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:39)
at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:415)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
Caused by: org.apache.flink.runtime.client.JobExecutionException:
Could not set up JobManager
at org.apache.flink.runtime.jobmaster.JobManagerRunner.(JobManagerRunner.java:176)
at org.apache.flink.runtime.dispatcher.Dispatcher$DefaultJobManagerRunnerFactory.createJobManagerRunner(Dispatcher.java:1058)
at org.apache.flink.runtime.dispatcher.Dispatcher.lambda$createJobManagerRunner$5(Dispatcher.java:308)
at org.apache.flink.util.function.CheckedSupplier.lambda$unchecked$0(CheckedSupplier.java:34)
... 7 common frames omitted Caused by: java.lang.IllegalStateException: There is no operator for the state
b22e6e8baea7d7e562d5a233f3301ce1
at org.apache.flink.runtime.checkpoint.StateAssignmentOperation.checkStateMappingCompleteness(StateAssignmentOperation.java:569)
at org.apache.flink.runtime.checkpoint.StateAssignmentOperation.assignStates(StateAssignmentOperation.java:77)
at org.apache.flink.runtime.checkpoint.CheckpointCoordinator.restoreLatestCheckpointedState(CheckpointCoordinator.java:1049)
at org.apache.flink.runtime.jobmaster.JobMaster.createAndRestoreExecutionGraph(JobMaster.java:1138)
at org.apache.flink.runtime.jobmaster.JobMaster.(JobMaster.java:294)
at org.apache.flink.runtime.jobmaster.JobManagerRunner.(JobManagerRunner.java:157)
... 10 common frames omitted 2019-02-06 08:51:23.013 [TransientBlobCache shutdown hook] INFO
org.apache.flink.runtime.blob.TransientBlobCache - Shutting down BLOB
cache 2019-02-06 08:51:23.033 [BlobServer shutdown hook] INFO
org.apache.flink.runtime.blob.BlobServer - Stopped BLOB server at
0.0.0.0:6124
Expected result:
The job will start running from the latest checkpoint and will skip the state that cannot be restored

Apache flink 1.6 HA standalone cluster: Fatal error in the cluster entrypoint

I am trying to setup Apache Flink standalone cluster consisting of 2 master nodes and one worker node. Using Flink 1.6 and Zookeeper. To start and stop cluster I used process described in Flink's 1.6 documentation, i.e. to start cluster I ran start-zookeeper-quorum.sh and then start-cluster.sh
and to stop cluster I ran stop-cluster.sh
After running one job (which failed), then stopping and restarting cluster again I noticed error where none of 2 the job managers could start because they are looking for directory job_e44fdee88a931200953fed45883ee3f1 which does not exist (I am assuming this is directory for my failed job, but not sure)
How do I recover cluster from this error?
2018-09-06 14:58:04,065 ERROR org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Fatal error occurred in the cluster entrypoint.
java.lang.RuntimeException: org.apache.flink.runtime.client.JobExecutionException: Could not set up JobManager
at org.apache.flink.util.ExceptionUtils.rethrow(ExceptionUtils.java:199)
at org.apache.flink.util.function.ConsumerWithException.accept(ConsumerWithException.java:40)
at org.apache.flink.runtime.dispatcher.Dispatcher.lambda$waitForTerminatingJobManager$29(Dispatcher.java:820)
at java.util.concurrent.CompletableFuture.uniRun(CompletableFuture.java:705)
at java.util.concurrent.CompletableFuture$UniRun.tryFire(CompletableFuture.java:687)
at java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:442)
at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:332)
at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:158)
at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:70)
at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.onReceive(AkkaRpcActor.java:142)
at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.onReceive(FencedAkkaRpcActor.java:40)
at akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:165)
at akka.actor.Actor$class.aroundReceive(Actor.scala:502)
at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:95)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)
at akka.actor.ActorCell.invoke(ActorCell.scala:495)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)
at akka.dispatch.Mailbox.run(Mailbox.scala:224)
at akka.dispatch.Mailbox.exec(Mailbox.scala:234)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
Caused by: org.apache.flink.runtime.client.JobExecutionException: Could not set up JobManager
at org.apache.flink.runtime.jobmaster.JobManagerRunner.<init>(JobManagerRunner.java:176)
at org.apache.flink.runtime.dispatcher.Dispatcher$DefaultJobManagerRunnerFactory.createJobManagerRunner(Dispatcher.java:936)
at org.apache.flink.runtime.dispatcher.Dispatcher.createJobManagerRunner(Dispatcher.java:291)
at org.apache.flink.runtime.dispatcher.Dispatcher.runJob(Dispatcher.java:281)
at org.apache.flink.util.function.ConsumerWithException.accept(ConsumerWithException.java:38)
:
... 21 more
Caused by: java.lang.Exception: Cannot set up the user code libraries: /hastorage/default/blob/job_e44fdee88a931200953fed45883ee3f1/blob_p-f655414c973995e93709acbd22c1c162c9c43a98-75bd4e71882f988a6c337222efadba7b (No such file or directory)
at org.apache.flink.runtime.jobmaster.JobManagerRunner.<init>(JobManagerRunner.java:134)
... 25 more
Caused by: java.io.FileNotFoundException: /hastorage/default/blob/job_e44fdee88a931200953fed45883ee3f1/blob_p-f655414c973995e93709acbd22c1c162c9c43a98-75bd4e71882f988a6c337222efadba7b (No such file or directory)
at java.io.FileInputStream.open0(Native Method)
at java.io.FileInputStream.open(FileInputStream.java:195)
at java.io.FileInputStream.<init>(FileInputStream.java:138)
at org.apache.flink.core.fs.local.LocalDataInputStream.<init>(LocalDataInputStream.java:50)
at org.apache.flink.core.fs.local.LocalFileSystem.open(LocalFileSystem.java:142)
at org.apache.flink.runtime.blob.FileSystemBlobStore.get(FileSystemBlobStore.java:102)
at org.apache.flink.runtime.blob.FileSystemBlobStore.get(FileSystemBlobStore.java:84)
at org.apache.flink.runtime.blob.BlobServer.getFileInternal(BlobServer.java:493)
at org.apache.flink.runtime.blob.BlobServer.getFileInternal(BlobServer.java:444)
at org.apache.flink.runtime.blob.BlobServer.getFile(BlobServer.java:417)
at org.apache.flink.runtime.execution.librarycache.BlobLibraryCacheManager.registerTask(BlobLibraryCacheManager.java:120)
at org.apache.flink.runtime.execution.librarycache.BlobLibraryCacheManager.registerJob(BlobLibraryCacheManager.java:91)
at org.apache.flink.runtime.jobmaster.JobManagerRunner.<init>(JobManagerRunner.java:131)
... 25 more
2018-09-06 14:58:04,069 INFO org.apache.flink.runtime.blob.TransientBlobCache - Shutting down BLOB cache
The problem you are observing is caused by a bug in Flink. You can find more details about the problem here. The problem will be fixed with the next bug fix release.

Flink HA Cluster JobManager issues

I have a setup with flink 1.2 cluster, made up of 3 JobManagers and 2 TaskManagers. I start the Zookeeper Quorum from JobManager1, I get confirmation Zookeeper starts on the other 2 JobManagers then I start a Flink job on this JobManager1.
The flink-conf.yaml is the same on all 5 VMs this means jobmanager.rpc.address: points to JobManager1 everywhere.
If I turn off the VM running JobManager1 I would expect Zookeeper to say one of the remaining JobManagers is the leader and the TaskManagers should reconnect to it. Instead I get in the TaskManagers' logs a lot of these messages
2017-03-14 14:13:21,827 INFO org.apache.flink.runtime.taskmanager.TaskManager - Trying to register at JobManager akka.tcp://flink#1.2.3.4:43660/user/jobmanager (attempt 11, timeout: 30 seconds)
2017-03-14 14:13:21,836 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink#1.2.3.4:43660] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://flink#1.2.3.4:43660]] Caused by: [Connection refused: /1.2.3.4:43660]
I modified the original IP to 1.2.3.4 for confidentiality and because it's always the same IP (of JobManager1).
More logs:
2017-03-15 10:28:28,655 INFO org.apache.flink.core.fs.FileSystem - Ensuring all FileSystem streams are closed for Async calls on Source: Custom Source -> Flat Map (1/1)
2017-03-15 10:28:38,534 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink#1.2.3.4:44779] has failed, address is now gated for [5000] ms. Reason: [Disassociated]
2017-03-15 10:28:46,606 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink#1.2.3.4:44779] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://flink#1.2.3.4:44779]] Caused by: [Connection refused: /1.2.3.4:44779]
2017-03-15 10:28:52,431 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink#1.2.3.4:44779] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://flink#1.2.3.4:44779]] Caused by: [Connection refused: /1.2.3.4:44779]
2017-03-15 10:29:02,435 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink#1.2.3.4:44779] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://flink#1.2.3.4:44779]] Caused by: [Connection refused: /1.2.3.4:44779]
2017-03-15 10:29:10,489 INFO org.apache.flink.runtime.taskmanager.TaskManager - TaskManager akka://flink/user/taskmanager disconnects from JobManager akka.tcp://flink#1.2.3.4:44779/user/jobmanager: Old JobManager lost its leadership.
2017-03-15 10:29:10,490 INFO org.apache.flink.runtime.taskmanager.TaskManager - Cancelling all computations and discarding all cached data.
2017-03-15 10:29:10,491 INFO org.apache.flink.runtime.taskmanager.Task - Attempting to fail task externally Source: Custom Source -> Flat Map (1/1) (75fd495cc6acfd72fbe957e60e513223).
2017-03-15 10:29:10,491 INFO org.apache.flink.runtime.taskmanager.Task - Source: Custom Source -> Flat Map (1/1) (75fd495cc6acfd72fbe957e60e513223) switched from RUNNING to FAILED.
java.lang.Exception: TaskManager akka://flink/user/taskmanager disconnects from JobManager akka.tcp://flink#1.2.3.4:44779/user/jobmanager: Old JobManager lost its leadership.
at org.apache.flink.runtime.taskmanager.TaskManager.handleJobManagerDisconnect(TaskManager.scala:1074)
at org.apache.flink.runtime.taskmanager.TaskManager.org$apache$flink$runtime$taskmanager$TaskManager$$handleJobManagerLeaderAddress(TaskManager.scala:1426)
at org.apache.flink.runtime.taskmanager.TaskManager$$anonfun$handleMessage$1.applyOrElse(TaskManager.scala:286)
at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
at org.apache.flink.runtime.LeaderSessionMessageFilter$$anonfun$receive$1.applyOrElse(LeaderSessionMessageFilter.scala:44)
at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
at org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:33)
at org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:28)
at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123)
at org.apache.flink.runtime.LogMessages$$anon$1.applyOrElse(LogMessages.scala:28)
at akka.actor.Actor$class.aroundReceive(Actor.scala:467)
at org.apache.flink.runtime.taskmanager.TaskManager.aroundReceive(TaskManager.scala:122)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
at akka.actor.ActorCell.invoke(ActorCell.scala:487)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
at akka.dispatch.Mailbox.run(Mailbox.scala:220)
at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:397)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
2017-03-15 10:29:10,512 INFO org.apache.flink.runtime.taskmanager.Task - Triggering cancellation of task code Source: Custom Source -> Flat Map (1/1) (75fd495cc6acfd72fbe957e60e513223).
2017-03-15 10:29:10,515 INFO org.apache.flink.runtime.taskmanager.Task - Attempting to fail task externally Flat Map (1/1) (dd555e0437867c3180a1ecaf0a9f4d04).
2017-03-15 10:29:10,515 INFO org.apache.flink.runtime.taskmanager.Task - Flat Map (1/1) (dd555e0437867c3180a1ecaf0a9f4d04) switched from RUNNING to FAILED.
java.lang.Exception: TaskManager akka://flink/user/taskmanager disconnects from JobManager akka.tcp://flink#1.2.3.4:44779/user/jobmanager: Old JobManager lost its leadership.
at org.apache.flink.runtime.taskmanager.TaskManager.handleJobManagerDisconnect(TaskManager.scala:1074)
at org.apache.flink.runtime.taskmanager.TaskManager.org$apache$flink$runtime$taskmanager$TaskManager$$handleJobManagerLeaderAddress(TaskManager.scala:1426)
at org.apache.flink.runtime.taskmanager.TaskManager$$anonfun$handleMessage$1.applyOrElse(TaskManager.scala:286)
at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
at org.apache.flink.runtime.LeaderSessionMessageFilter$$anonfun$receive$1.applyOrElse(LeaderSessionMessageFilter.scala:44)
at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
at org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:33)
at org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:28)
at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123)
at org.apache.flink.runtime.LogMessages$$anon$1.applyOrElse(LogMessages.scala:28)
at akka.actor.Actor$class.aroundReceive(Actor.scala:467)
at org.apache.flink.runtime.taskmanager.TaskManager.aroundReceive(TaskManager.scala:122)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
at akka.actor.ActorCell.invoke(ActorCell.scala:487)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
at akka.dispatch.Mailbox.run(Mailbox.scala:220)
at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:397)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
2017-03-15 10:29:10,516 INFO org.apache.flink.runtime.taskmanager.Task - Triggering cancellation of task code Flat Map (1/1) (dd555e0437867c3180a1ecaf0a9f4d04).
2017-03-15 10:29:10,516 INFO org.apache.flink.runtime.taskmanager.TaskManager - Disassociating from JobManager
2017-03-15 10:29:10,525 INFO org.apache.flink.runtime.blob.BlobCache - Shutting down BlobCache
2017-03-15 10:29:10,542 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink#1.2.3.4:44779] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://flink#1.2.3.4:44779]] Caused by: [Connection refused: /1.2.3.4:44779]
2017-03-15 10:29:10,546 INFO org.apache.flink.runtime.taskmanager.Task - Freeing task resources for Source: Custom Source -> Flat Map (1/1) (75fd495cc6acfd72fbe957e60e513223).
2017-03-15 10:29:10,548 INFO org.apache.flink.runtime.taskmanager.Task - Freeing task resources for Flat Map (1/1) (dd555e0437867c3180a1ecaf0a9f4d04).
2017-03-15 10:29:10,551 INFO org.apache.flink.core.fs.FileSystem - Ensuring all FileSystem streams are closed for Flat Map (1/1)
2017-03-15 10:29:10,552 INFO org.apache.flink.runtime.taskmanager.TaskManager - Trying to register at JobManager akka.tcp://flink#1.2.3.5:43893/user/jobmanager (attempt 1, timeout: 500 milliseconds)
2017-03-15 10:29:10,567 INFO org.apache.flink.core.fs.FileSystem - Ensuring all FileSystem streams are closed for Source: Custom Source -> Flat Map (1/1)
2017-03-15 10:29:10,632 INFO org.apache.flink.runtime.taskmanager.TaskManager - Successful registration at JobManager (akka.tcp://flink#1.2.3.5:43893/user/jobmanager), starting network stack and library cache.
2017-03-15 10:29:10,633 INFO org.apache.flink.runtime.taskmanager.TaskManager - Determined BLOB server address to be /1.2.3.5:42830. Starting BLOB cache.
2017-03-15 10:29:10,633 INFO org.apache.flink.runtime.blob.BlobCache - Created BLOB cache storage directory /tmp/blobStore-d97e08db-d2f1-4f00-a7d1-30c2f5823934
2017-03-15 10:29:15,551 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink#1.2.3.4:44779] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://flink#1.2.3.4:44779]] Caused by: [Connection refused: /1.2.3.4:44779]
2017-03-15 10:29:20,571 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink#1.2.3.4:44779] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://flink#1.2.3.4:44779]] Caused by: [Connection refused: /1.2.3.4:44779]
2017-03-15 10:29:25,582 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink#1.2.3.4:44779] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://flink#1.2.3.4:44779]] Caused by: [Connection refused: /1.2.3.4:44779]
2017-03-15 10:29:30,592 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink#1.2.3.4:44779] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://flink#1.2.3.4:44779]] Caused by: [Connection refused: /1.2.3.4:44779]
Does anyone know why the TaskManagers are not trying to reconnect to one of the remaining JobManagers (like 1.2.3.5 above)?
Thanks!
For everyone facing the same issue, HA requires you to provide a DFS location accessible from all nodes. I had backend state checkpoint directory and zookeeper storage directory pointing on each VM to a local filesystem location and when one of the JobManagers went down the new leader couldn't resume the running jobs because of lack of information / location not accessible.
Edit: Since this was asked, the file I modified (In the case of Apache Flink 1.2 (https://ci.apache.org/projects/flink/flink-docs-release-1.2/setup/config.html)) was
conf/flink-conf.yaml
I set
state.backend.fs.checkpointdir
high-availability.zookeeper.storageDir
to AWS S3 paths .accessible from both TaskManagers and JobManagers.

Which ports should I open in firewall on nodes with Apach Flink?

When I try to run my flow on Apache Flink standalone cluster I see the following exception:
java.lang.IllegalStateException: Update task on instance aaa0859f6af25decf1f5fc1821ffa55d # app-2 - 4 slots - URL: akka.tcp://flink#192.168.38.98:46369/user/taskmanager failed due to:
at org.apache.flink.runtime.executiongraph.Execution$6.onFailure(Execution.java:954)
at akka.dispatch.OnFailure.internal(Future.scala:228)
at akka.dispatch.OnFailure.internal(Future.scala:227)
at akka.dispatch.japi$CallbackBridge.apply(Future.scala:174)
at akka.dispatch.japi$CallbackBridge.apply(Future.scala:171)
at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123)
at scala.runtime.AbstractPartialFunction.applyOrElse(AbstractPartialFunction.scala:28)
at scala.concurrent.Future$$anonfun$onFailure$1.apply(Future.scala:136)
at scala.concurrent.Future$$anonfun$onFailure$1.apply(Future.scala:134)
at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)
at scala.concurrent.impl.ExecutionContextImpl$AdaptedForkJoinTask.exec(ExecutionContextImpl.scala:121)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
Caused by: akka.pattern.AskTimeoutException: Ask timed out on [Actor[akka.tcp://flink#192.168.38.98:46369/user/taskmanager#1804590378]] after [10000 ms]
at akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:333)
at akka.actor.Scheduler$$anon$7.run(Scheduler.scala:117)
at scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:599)
at scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109)
at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:597)
at akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(Scheduler.scala:467)
at akka.actor.LightArrayRevolverScheduler$$anon$8.executeBucket$1(Scheduler.scala:419)
at akka.actor.LightArrayRevolverScheduler$$anon$8.nextTick(Scheduler.scala:423)
at akka.actor.LightArrayRevolverScheduler$$anon$8.run(Scheduler.scala:375)
at java.lang.Thread.run(Thread.java:745)
Seems like port 46369 blocked by firewall. It is true because I read configuration section and open these ports only:
6121:
comment: Apache Flink TaskManager (Data Exchange)
6122:
comment: Apache Flink TaskManager (IPC)
6123:
comment: Apache Flink JobManager
6130:
comment: Apache Flink JobManager (BLOB Server)
8081:
comment: Apache Flink JobManager (Web UI)
The same ports described in flink-conf.yaml:
jobmanager.rpc.address: app-1.stag.local
jobmanager.rpc.port: 6123
jobmanager.heap.mb: 1024
taskmanager.heap.mb: 2048
taskmanager.numberOfTaskSlots: 4
taskmanager.memory.preallocate: false
blob.server.port: 6130
parallelism.default: 4
jobmanager.web.port: 8081
state.backend: jobmanager
restart-strategy: none
restart-strategy.fixed-delay.attempts: 2
restart-strategy.fixed-delay.delay: 60s
So, I have two questions:
This exception related to blocked ports. Right?
Which ports should I open on firewall for standalone Apache Flink cluster?
UPDATE 1
I found configuration problem in masters and slaves files (I skip new line separators between hosts described in these files). I fixed it and now I see other exceptions:
flink--taskmanager-0-app-1.stag.local.log
flink--taskmanager-0-app-2.stag.local.log
I have 2 nodes:
app-1.stag.local (with running job and task managers)
app-2.stag.local (with running task manager)
As you can see from these logs the app-1.stag.local task manager can't connect to other task manager:
java.io.IOException: Connecting the channel failed: Connecting to remote task manager + 'app-2.stag.local/192.168.38.98:35806' has failed. This might indicate that the remote task manager has been lost.
but app-2.stag.local has open port:
2016-03-18 16:24:14,347 INFO org.apache.flink.runtime.io.network.netty.NettyServer - Successful initialization (took 39 ms). Listening on SocketAddress /192.168.38.98:35806
So, I think problem related to firewall but I don't understand where I can configure this port (or range of ports) in Apache Flink.
I have found a problem: taskmanager.data.port parameter was set to 0 by default (but documentation say what it should be set to 6121).
So, I set this port in flink-conf.yaml and now all works fine.

Resources