Flink CLI throws exception on EMR on a yarn cluster - apache-flink

After moving my enviornment from standalone cluster to yarn EMR cluster, I have been running into issues after with the flink cli commands when a job is running for a long time. Running flink list on the CLI I will get an exception thrown:
> bin/flink list
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/home/hadoop/flink-1.6.0/lib/slf4j-log4j12-1.7.7.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/lib/hadoop/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
Waiting for response...
------------------------------------------------------------
The program finished with the following exception:
org.apache.flink.util.FlinkException: Failed to retrieve job list.
at org.apache.flink.client.cli.CliFrontend.listJobs(CliFrontend.java:438)
at org.apache.flink.client.cli.CliFrontend.lambda$list$0(CliFrontend.java:420)
at org.apache.flink.client.cli.CliFrontend.runClusterAction(CliFrontend.java:979)
at org.apache.flink.client.cli.CliFrontend.list(CliFrontend.java:417)
at org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:1047)
at org.apache.flink.client.cli.CliFrontend.lambda$main$11(CliFrontend.java:1120)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1836)
at org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:1120)
Caused by: org.apache.flink.runtime.concurrent.FutureUtils$RetryException: Could not complete the operation. Number of retries has been exhausted.
at org.apache.flink.runtime.concurrent.FutureUtils.lambda$retryOperationWithDelay$5(FutureUtils.java:213)
at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760)
at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736)
at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
at org.apache.flink.runtime.rest.RestClient.lambda$submitRequest$1(RestClient.java:276)
at org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:511)
at org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListeners0(DefaultPromise.java:504)
at org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:483)
at org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:424)
at org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:121)
at org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.fulfillConnectPromise(AbstractNioChannel.java:327)
at org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:343)
at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:633)
at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:580)
at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:497)
at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:459)
at org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:884)
at org.apache.flink.shaded.netty4.io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.util.concurrent.CompletionException: org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: localhost/127.0.0.1:8081
at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)
at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)
at java.util.concurrent.CompletableFuture.uniCompose(CompletableFuture.java:943)
at java.util.concurrent.CompletableFuture$UniCompose.tryFire(CompletableFuture.java:926)
... 17 more
Caused by: org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: localhost/127.0.0.1:8081
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
at org.apache.flink.shaded.netty4.io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:325)
at org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:340)
... 7 more
Caused by: java.net.ConnectException: Connection refused
... 11 more
The job itself, as well as yarn, seem to be fine, there is no issue there. I am unsure of the time it takes for this to happen, I have some jobs running for about a week with no problems, but usually after 2+ weeks the exception will occur at some point. I am currently running version 1.6.0.
I am not sure would logs would be useful in this case, but would be happy to provide anything I can in order to solve this problem.
Thank you
Update with logs:
018-11-28 17:01:52,368 INFO org.apache.flink.client.cli.CliFrontend - --------------------------------------------------------------------------------
2018-11-28 17:01:52,369 INFO org.apache.flink.client.cli.CliFrontend - Starting Command Line Client (Version: 1.6.0, Rev:ff472b4, Date:07.08.2018 # 13:31:13 UTC)
2018-11-28 17:01:52,369 INFO org.apache.flink.client.cli.CliFrontend - OS current user: hadoop
2018-11-28 17:01:52,790 INFO org.apache.flink.client.cli.CliFrontend - Current Hadoop/Kerberos user: hadoop
2018-11-28 17:01:52,790 INFO org.apache.flink.client.cli.CliFrontend - JVM: OpenJDK 64-Bit Server VM - Oracle Corporation - 1.8/25.181-b13
2018-11-28 17:01:52,790 INFO org.apache.flink.client.cli.CliFrontend - Maximum heap size: 7150 MiBytes
2018-11-28 17:01:52,790 INFO org.apache.flink.client.cli.CliFrontend - JAVA_HOME: /etc/alternatives/jre
2018-11-28 17:01:52,792 INFO org.apache.flink.client.cli.CliFrontend - Hadoop version: 2.8.3
2018-11-28 17:01:52,792 INFO org.apache.flink.client.cli.CliFrontend - JVM Options:
2018-11-28 17:01:52,792 INFO org.apache.flink.client.cli.CliFrontend - -Dlog.file=/home/hadoop/flink-1.6.0/log/flink-hadoop-client.log
2018-11-28 17:01:52,792 INFO org.apache.flink.client.cli.CliFrontend - -Dlog4j.configuration=file:/home/hadoop/flink-1.6.0/conf/log4j-cli.properties
2018-11-28 17:01:52,792 INFO org.apache.flink.client.cli.CliFrontend - -Dlogback.configurationFile=file:/home/hadoop/flink-1.6.0/conf/logback.xml
2018-11-28 17:01:52,792 INFO org.apache.flink.client.cli.CliFrontend - Program Arguments:
2018-11-28 17:01:52,792 INFO org.apache.flink.client.cli.CliFrontend - list
2018-11-28 17:01:52,794 INFO org.apache.flink.client.cli.CliFrontend - --------------------------------------------------------------------------------
2018-11-28 17:01:52,797 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: jobmanager.rpc.address, localhost
2018-11-28 17:01:52,797 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: jobmanager.rpc.port, 6123
2018-11-28 17:01:52,797 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: jobmanager.heap.size, 20480m
2018-11-28 17:01:52,797 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: taskmanager.heap.size, 20480m
2018-11-28 17:01:52,797 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: taskmanager.memory.fraction, 0.9
2018-11-28 17:01:52,797 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: taskmanager.numberOfTaskSlots, 1
2018-11-28 17:01:52,797 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: parallelism.default, 1
2018-11-28 17:01:52,798 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: state.backend, rocksdb
2018-11-28 17:01:52,798 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: state.backend.fs.checkpointdir, s3://bucket/checkpoint
2018-11-28 17:01:52,798 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: state.checkpoints.dir, s3://bucket/checkpoint
2018-11-28 17:01:52,798 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: rest.port, 8081
2018-11-28 17:01:52,798 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: web.timeout, 60000
2018-11-28 17:01:52,798 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: akka.ask.timeout, 60s
2018-11-28 17:01:53,029 INFO org.apache.flink.runtime.security.modules.HadoopModule - Hadoop user set to hadoop (auth:SIMPLE)
2018-11-28 17:01:53,051 INFO org.apache.flink.client.cli.CliFrontend - Running 'list' command.
2018-11-28 17:01:53,082 WARN org.apache.flink.configuration.Configuration - Config uses deprecated configuration key 'jobmanager.rpc.address' instead of proper key 'rest.address'
2018-11-28 17:01:53,256 INFO org.apache.flink.runtime.rest.RestClient - Rest client endpoint started.
2018-11-28 17:01:53,418 INFO org.apache.flink.client.cli.CliFrontend - Waiting for response...
2018-11-28 17:02:53,492 INFO org.apache.flink.runtime.rest.RestClient - Shutting down rest endpoint.
2018-11-28 17:02:53,493 INFO org.apache.flink.runtime.rest.RestClient - Rest endpoint shutdown complete.
2018-11-28 17:02:53,495 ERROR org.apache.flink.client.cli.CliFrontend - Error while running the command.
org.apache.flink.util.FlinkException: Failed to retrieve job list.
at org.apache.flink.client.cli.CliFrontend.listJobs(CliFrontend.java:438)
at org.apache.flink.client.cli.CliFrontend.lambda$list$0(CliFrontend.java:420)
at org.apache.flink.client.cli.CliFrontend.runClusterAction(CliFrontend.java:979)
at org.apache.flink.client.cli.CliFrontend.list(CliFrontend.java:417)
at org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:1047)
at org.apache.flink.client.cli.CliFrontend.lambda$main$11(CliFrontend.java:1120)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1836)
at org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:1120)
Caused by: org.apache.flink.runtime.concurrent.FutureUtils$RetryException: Could not complete the operation. Number of retries has been exhausted.
at org.apache.flink.runtime.concurrent.FutureUtils.lambda$retryOperationWithDelay$5(FutureUtils.java:213)
at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760)
at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736)
at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
at org.apache.flink.runtime.rest.RestClient.lambda$submitRequest$1(RestClient.java:276)
at org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:511)
at org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListeners0(DefaultPromise.java:504)
at org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:483)
at org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:424)
at org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:121)
at org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.fulfillConnectPromise(AbstractNioChannel.java:327)
at org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:343)
at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:633)
at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:580)
at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:497)
at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:459)
at org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:884)
at org.apache.flink.shaded.netty4.io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.util.concurrent.CompletionException: org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: localhost/127.0.0.1:8081
at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)
at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)
at java.util.concurrent.CompletableFuture.uniCompose(CompletableFuture.java:943)
at java.util.concurrent.CompletableFuture$UniCompose.tryFire(CompletableFuture.java:926)
... 17 more
Caused by: org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: localhost/127.0.0.1:8081
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
at org.apache.flink.shaded.netty4.io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:325)
at org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:340)
... 7 more
Caused by: java.net.ConnectException: Connection refused
... 11 more
Entrypoint + config logs
Container: container_1541525872902_0001_01_000001 on compute.internal_8041
=======================================================================================================
LogType:jobmanager.log
Log Upload Time:Thu Nov 29 20:13:10 +0000 2018
LogLength:12837590
Log Contents:
2018-11-06 18:26:25,585 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - --------------------------------------------------------------------------------
2018-11-06 18:26:25,586 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Starting YarnSessionClusterEntrypoint (Version: 1.6.0, Rev:ff472b4, Date:07.08.2018 # 13:31:13 UTC)
2018-11-06 18:26:25,586 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - OS current user: yarn
2018-11-06 18:26:26,007 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Current Hadoop/Kerberos user: hadoop
2018-11-06 18:26:26,007 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - JVM: OpenJDK 64-Bit Server VM - Oracle Corporation - 1.8/25.181-b13
2018-11-06 18:26:26,007 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Maximum heap size: 13653 MiBytes
2018-11-06 18:26:26,007 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - JAVA_HOME: /usr/lib/jvm/java-openjdk
2018-11-06 18:26:26,008 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Hadoop version: 2.8.3
2018-11-06 18:26:26,008 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - JVM Options:
2018-11-06 18:26:26,008 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - -Xmx15360m
2018-11-06 18:26:26,008 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - -Dlog.file=/var/log/hadoop-yarn/containers/application_1541525872902_0001/container_1541525872902_0001_01_000001/jobmanager.log
2018-11-06 18:26:26,008 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - -Dlogback.configurationFile=file:logback.xml
2018-11-06 18:26:26,008 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - -Dlog4j.configuration=file:log4j.properties
2018-11-06 18:26:26,008 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Program Arguments: (none)
2018-11-06 18:26:26,010 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - --------------------------------------------------------------------------------
2018-11-06 18:26:26,011 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Registered UNIX signal handlers for [TERM, HUP, INT]
2018-11-06 18:26:26,013 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - YARN daemon is running as: hadoop Yarn client user obtainer: hadoop
2018-11-06 18:26:26,015 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: state.backend.fs.checkpointdir, s3://bucket/checkpoint
2018-11-06 18:26:26,016 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: web.timeout, 60000
2018-11-06 18:26:26,016 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: high-availability.cluster-id, application_1541525872902_0001
2018-11-06 18:26:26,016 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: jobmanager.rpc.address, localhost
2018-11-06 18:26:26,016 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: jobmanager.rpc.port, 6123
2018-11-06 18:26:26,016 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: rest.port, 8081
2018-11-06 18:26:26,016 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: internal.cluster.execution-mode, NORMAL
2018-11-06 18:26:26,016 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: taskmanager.memory.fraction, 0.9
2018-11-06 18:26:26,016 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: parallelism.default, 1
2018-11-06 18:26:26,016 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: taskmanager.numberOfTaskSlots, 1
2018-11-06 18:26:26,017 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: state.backend, rocksdb
2018-11-06 18:26:26,017 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: akka.ask.timeout, 60s
2018-11-06 18:26:26,017 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: taskmanager.heap.size, 20480m
2018-11-06 18:26:26,017 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: jobmanager.heap.size, 20480m
2018-11-06 18:26:26,017 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: state.checkpoints.dir, s3://bucket/checkpoint
2018-11-06 18:26:26,031 INFO org.apache.flink.runtime.clusterframework.BootstrapTools - Setting directories for temporary files to: /mnt/yarn/usercache/hadoop/appcache/application_1541525872902_0001
2018-11-06 18:26:26,046 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Starting YarnSessionClusterEntrypoint.
2018-11-06 18:26:26,046 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Install default filesystem.
2018-11-06 18:26:26,108 INFO org.apache.flink.runtime.security.modules.HadoopModule - Hadoop user set to hadoop (auth:SIMPLE)
2018-11-06 18:26:26,125 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Initializing cluster services.
2018-11-06 18:26:26,131 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Trying to start actor system at compute.internal:40607
2018-11-06 18:26:26,612 INFO akka.event.slf4j.Slf4jLogger - Slf4jLogger started
2018-11-06 18:26:26,706 INFO akka.remote.Remoting - Starting remoting
2018-11-06 18:26:26,804 INFO akka.remote.Remoting - Remoting started; listening on addresses :[akka.tcp://flink#compute.internal:40607]
2018-11-06 18:26:26,813 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Actor system started at akka.tcp://flink#compute.internal:40607
2018-11-06 18:26:26,838 INFO org.apache.flink.runtime.blob.BlobServer - Created BLOB server storage directory /mnt/yarn/usercache/hadoop/appcache/application_1541525872902_0001/blobStore-b4eb7331-9ac8-4fc9-ab1f-64f6a9c8173f
2018-11-06 18:26:26,842 INFO org.apache.flink.runtime.blob.BlobServer - Started BLOB server at 0.0.0.0:32901 - max concurrent requests: 50 - max backlog: 1000
2018-11-06 18:26:26,857 INFO org.apache.flink.runtime.metrics.MetricRegistryImpl - No metrics reporter configured, no metrics will be exposed/reported.
2018-11-06 18:26:26,860 INFO org.apache.flink.runtime.dispatcher.FileArchivedExecutionGraphStore - Initializing FileArchivedExecutionGraphStore: Storage directory /mnt/yarn/usercache/hadoop/appcache/application_1541525872902_0001/executionGraphStore-0b405259-cc50-4332-93dc-847b92071699, expiration time 3600000, maximum cache size 52428800 bytes.
2018-11-29 20:13:10,717 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - RECEIVED SIGNAL 15: SIGTERM. Shutting down as requested.

Related

flink task manager could not register at job manager

I'm trying to create simple multi node flink cluster (1 master 1 slave). When I start my cluster using "./bin/start-cluster.sh", both job manager and task manager are started, but the task manager is not able to register at the job manager. After few minutes of trying, the task manager dies.
Details about the environment:
I'm working with Google cloud VMs. OS is Ubuntu x86_64
tried with flink versions flink-1.7.2 and flink-1.8.0. Both gave the same error.
job manager hostname = ubuntu-test-1 (10.142.0.40)task manager hostname = ubuntu-test-2 (10.142.15.250)
$ cat conf/flink-conf.yaml:
env.java.home: /opt/sample/include/jdk
jobmanager.rpc.address: 10.142.0.40
jobmanager.rpc.port: 6123
jobmanager.heap.size: 1024m
taskmanager.heap.size: 1024m
taskmanager.numberOfTaskSlots: 1
parallelism.default: 1
rest.port: 8081
$cat conf/masters
10.142.0.40:8081
$ cat conf/slaves
10.142.15.250
Below is the complete log from task manager:
2019-06-25 05:44:36,335 INFO org.apache.flink.runtime.taskexecutor.TaskManagerRunner - --------------------------------------------------------------------------------
2019-06-25 05:44:36,336 INFO org.apache.flink.runtime.taskexecutor.TaskManagerRunner - Starting TaskManager (Version: 1.7.2, Rev:ceba8af, Date:11.02.2019 # 14:17:09 UTC)
2019-06-25 05:44:36,337 INFO org.apache.flink.runtime.taskexecutor.TaskManagerRunner - OS current user: sample
2019-06-25 05:44:36,337 INFO org.apache.flink.runtime.taskexecutor.TaskManagerRunner - Current Hadoop/Kerberos user: <no hadoop dependency found>
2019-06-25 05:44:36,337 INFO org.apache.flink.runtime.taskexecutor.TaskManagerRunner - JVM: Java HotSpot(TM) 64-Bit Server VM - Oracle Corporation - 1.8/25.121-b13
2019-06-25 05:44:36,337 INFO org.apache.flink.runtime.taskexecutor.TaskManagerRunner - Maximum heap size: 922 MiBytes
2019-06-25 05:44:36,337 INFO org.apache.flink.runtime.taskexecutor.TaskManagerRunner - JAVA_HOME: (not set)
2019-06-25 05:44:36,337 INFO org.apache.flink.runtime.taskexecutor.TaskManagerRunner - No Hadoop Dependency available
2019-06-25 05:44:36,337 INFO org.apache.flink.runtime.taskexecutor.TaskManagerRunner - JVM Options:
2019-06-25 05:44:36,337 INFO org.apache.flink.runtime.taskexecutor.TaskManagerRunner - -XX:+UseG1GC
2019-06-25 05:44:36,338 INFO org.apache.flink.runtime.taskexecutor.TaskManagerRunner - -Xms922M
2019-06-25 05:44:36,338 INFO org.apache.flink.runtime.taskexecutor.TaskManagerRunner - -Xmx922M
2019-06-25 05:44:36,338 INFO org.apache.flink.runtime.taskexecutor.TaskManagerRunner - -XX:MaxDirectMemorySize=8388607T
2019-06-25 05:44:36,338 INFO org.apache.flink.runtime.taskexecutor.TaskManagerRunner - -Dlog.file=/var/tmp/flink-1.7.2/log/flink-sample-taskexecutor-0-ubuntu-test-2.log
2019-06-25 05:44:36,338 INFO org.apache.flink.runtime.taskexecutor.TaskManagerRunner - -Dlog4j.configuration=file:/var/tmp/flink-1.7.2/conf/log4j.properties
2019-06-25 05:44:36,338 INFO org.apache.flink.runtime.taskexecutor.TaskManagerRunner - -Dlogback.configurationFile=file:/var/tmp/flink-1.7.2/conf/logback.xml
2019-06-25 05:44:36,338 INFO org.apache.flink.runtime.taskexecutor.TaskManagerRunner - Program Arguments:
2019-06-25 05:44:36,338 INFO org.apache.flink.runtime.taskexecutor.TaskManagerRunner - --configDir
2019-06-25 05:44:36,338 INFO org.apache.flink.runtime.taskexecutor.TaskManagerRunner - /var/tmp/flink-1.7.2/conf
2019-06-25 05:44:36,338 INFO org.apache.flink.runtime.taskexecutor.TaskManagerRunner - Classpath: /var/tmp/flink-1.7.2/lib/flink-python_2.11-1.7.2.jar:/var/tmp/flink-1.7.2/lib/log4j-1.2.17.jar:/var/tmp/flink-1.7.2/lib/slf4j-log4j12-1.7.15.jar:/var/tmp/flink-1.7.2/lib/flink-dist_2.11-1.7.2.jar:::
2019-06-25 05:44:36,338 INFO org.apache.flink.runtime.taskexecutor.TaskManagerRunner - --------------------------------------------------------------------------------
2019-06-25 05:44:36,339 INFO org.apache.flink.runtime.taskexecutor.TaskManagerRunner - Registered UNIX signal handlers for [TERM, HUP, INT]
2019-06-25 05:44:36,343 INFO org.apache.flink.runtime.taskexecutor.TaskManagerRunner - Maximum number of open file descriptors is 100000.
2019-06-25 05:44:36,352 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: env.java.home, /opt/sample/include/jdk
2019-06-25 05:44:36,353 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: jobmanager.rpc.address, 10.142.0.40
2019-06-25 05:44:36,353 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: jobmanager.rpc.port, 6123
2019-06-25 05:44:36,353 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: jobmanager.heap.size, 1024m
2019-06-25 05:44:36,353 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: taskmanager.heap.size, 1024m
2019-06-25 05:44:36,353 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: taskmanager.numberOfTaskSlots, 1
2019-06-25 05:44:36,353 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: parallelism.default, 1
2019-06-25 05:44:36,354 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: rest.port, 8081
2019-06-25 05:44:36,360 INFO org.apache.flink.core.fs.FileSystem - Hadoop is not in the classpath/dependencies. The extended set of supported File Systems via Hadoop is not available.
2019-06-25 05:44:36,376 INFO org.apache.flink.runtime.security.modules.HadoopModuleFactory - Cannot create Hadoop Security Module because Hadoop cannot be found in the Classpath.
2019-06-25 05:44:36,395 INFO org.apache.flink.runtime.security.SecurityUtils - Cannot install HadoopSecurityContext because Hadoop cannot be found in the Classpath.
2019-06-25 05:44:36,559 WARN org.apache.flink.configuration.Configuration - Config uses deprecated configuration key 'jobmanager.rpc.address' instead of proper key 'rest.address'
2019-06-25 05:44:36,563 INFO org.apache.flink.runtime.util.LeaderRetrievalUtils - Trying to select the network interface and address to use by connecting to the leading JobManager.
2019-06-25 05:44:36,564 INFO org.apache.flink.runtime.util.LeaderRetrievalUtils - TaskManager will try to connect for 10000 milliseconds before falling back to heuristics
2019-06-25 05:44:36,567 INFO org.apache.flink.runtime.net.ConnectionUtils - Retrieved new target address /10.142.0.40:6123.
2019-06-25 05:44:36,571 INFO org.apache.flink.runtime.taskexecutor.TaskManagerRunner - TaskManager will use hostname/address 'ubuntu-test-2' (10.142.15.250) for communication.
2019-06-25 05:44:36,574 INFO org.apache.flink.runtime.rpc.akka.AkkaRpcServiceUtils - Trying to start actor system at ubuntu-test-2:0
2019-06-25 05:44:36,935 INFO akka.event.slf4j.Slf4jLogger - Slf4jLogger started
2019-06-25 05:44:37,004 INFO akka.remote.Remoting - Starting remoting
2019-06-25 05:44:37,108 INFO akka.remote.Remoting - Remoting started; listening on addresses :[akka.tcp://flink#ubuntu-test-2:33391]
2019-06-25 05:44:37,115 INFO org.apache.flink.runtime.rpc.akka.AkkaRpcServiceUtils - Actor system started at akka.tcp://flink#ubuntu-test-2:33391
2019-06-25 05:44:37,121 INFO org.apache.flink.runtime.taskexecutor.TaskManagerRunner - Trying to start actor system at ubuntu-test-2:0
2019-06-25 05:44:37,138 INFO akka.event.slf4j.Slf4jLogger - Slf4jLogger started
2019-06-25 05:44:37,144 INFO akka.remote.Remoting - Starting remoting
2019-06-25 05:44:37,152 INFO akka.remote.Remoting - Remoting started; listening on addresses :[akka.tcp://flink-metrics#ubuntu-test-2:46253]
2019-06-25 05:44:37,153 INFO org.apache.flink.runtime.taskexecutor.TaskManagerRunner - Actor system started at akka.tcp://flink-metrics#ubuntu-test-2:46253
2019-06-25 05:44:37,166 INFO org.apache.flink.runtime.metrics.MetricRegistryImpl - No metrics reporter configured, no metrics will be exposed/reported.
2019-06-25 05:44:37,171 INFO org.apache.flink.runtime.blob.PermanentBlobCache - Created BLOB cache storage directory /tmp/blobStore-4219e8ab-64ab-4eff-8320-8a50b550959d
2019-06-25 05:44:37,174 INFO org.apache.flink.runtime.blob.TransientBlobCache - Created BLOB cache storage directory /tmp/blobStore-959579c0-4892-4ba8-b7d3-63969e84f554
2019-06-25 05:44:37,175 INFO org.apache.flink.runtime.taskexecutor.TaskManagerRunner - Starting TaskManager with ResourceID: 3743bd08e81673b79e96d98ebab7a58a
2019-06-25 05:44:37,179 INFO org.apache.flink.runtime.io.network.netty.NettyConfig - NettyConfig [server address: ubuntu-test-2/10.142.15.250, server port: 0, ssl enabled: false, memory segment size (bytes): 32768, transport type: NIO, number of server threads: 1 (manual), number of client threads: 1 (manual), server connect backlog: 0 (use Netty's default), client connect timeout (sec): 120, send/receive buffer size (bytes): 0 (use Netty's default)]
2019-06-25 05:44:37,224 INFO org.apache.flink.runtime.taskexecutor.TaskManagerServices - Temporary file directory '/tmp': total 96 GB, usable 86 GB (89.58% usable)
2019-06-25 05:44:37,305 INFO org.apache.flink.runtime.io.network.buffer.NetworkBufferPool - Allocated 102 MB for network buffer pool (number of memory segments: 3278, bytes per segment: 32768).
2019-06-25 05:44:37,354 INFO org.apache.flink.runtime.query.QueryableStateUtils - Could not load Queryable State Client Proxy. Probable reason: flink-queryable-state-runtime is not in the classpath. To enable Queryable State, please move the flink-queryable-state-runtime jar from the opt to the lib folder.
2019-06-25 05:44:37,355 INFO org.apache.flink.runtime.query.QueryableStateUtils - Could not load Queryable State Server. Probable reason: flink-queryable-state-runtime is not in the classpath. To enable Queryable State, please move the flink-queryable-state-runtime jar from the opt to the lib folder.
2019-06-25 05:44:37,357 INFO org.apache.flink.runtime.io.network.NetworkEnvironment - Starting the network environment and its components.
2019-06-25 05:44:37,389 INFO org.apache.flink.runtime.io.network.netty.NettyClient - Successful initialization (took 30 ms).
2019-06-25 05:44:37,432 INFO org.apache.flink.runtime.io.network.netty.NettyServer - Successful initialization (took 42 ms). Listening on SocketAddress /10.142.15.250:41521.
2019-06-25 05:44:37,433 INFO org.apache.flink.runtime.taskexecutor.TaskManagerServices - Limiting managed memory to 0.7 of the currently free heap space (640 MB), memory will be allocated lazily.
2019-06-25 05:44:37,436 INFO org.apache.flink.runtime.io.disk.iomanager.IOManager - I/O manager uses directory /tmp/flink-io-9b6408aa-3a29-477b-8a4b-661401bad5b6 for spill files.
2019-06-25 05:44:37,496 INFO org.apache.flink.runtime.taskexecutor.TaskManagerConfiguration - Messages have a max timeout of 10000 ms
2019-06-25 05:44:37,503 INFO org.apache.flink.runtime.rpc.akka.AkkaRpcService - Starting RPC endpoint for org.apache.flink.runtime.taskexecutor.TaskExecutor at akka://flink/user/taskmanager_0 .
2019-06-25 05:44:37,520 INFO org.apache.flink.runtime.taskexecutor.JobLeaderService - Start job leader service.
2019-06-25 05:44:37,521 INFO org.apache.flink.runtime.taskexecutor.TaskExecutor - Connecting to ResourceManager akka.tcp://flink#10.142.0.40:6123/user/resourcemanager(00000000000000000000000000000000).
2019-06-25 05:44:37,521 INFO org.apache.flink.runtime.filecache.FileCache - User file cache uses directory /tmp/flink-dist-cache-504118c3-1bc2-4624-b1c4-7eacce681ba9
2019-06-25 05:44:47,542 INFO org.apache.flink.runtime.taskexecutor.TaskExecutor - Could not resolve ResourceManager address akka.tcp://flink#10.142.0.40:6123/user/resourcemanager, retrying in 10000 ms: Ask timed out on [ActorSelection[Anchor(akka.tcp://flink#10.142.0.40:6123/), Path(/user/resourcemanager)]] after [10000 ms]. Sender[null] sent message of type "akka.actor.Identify"..
2019-06-25 05:45:07,580 INFO org.apache.flink.runtime.taskexecutor.TaskExecutor - Could not resolve ResourceManager address akka.tcp://flink#10.142.0.40:6123/user/resourcemanager, retrying in 10000 ms: Ask timed out on [ActorSelection[Anchor(akka.tcp://flink#10.142.0.40:6123/), Path(/user/resourcemanager)]] after [10000 ms]. Sender[null] sent message of type "akka.actor.Identify"..
2019-06-25 05:45:27,620 INFO org.apache.flink.runtime.taskexecutor.TaskExecutor - Could not resolve ResourceManager address akka.tcp://flink#10.142.0.40:6123/user/resourcemanager, retrying in 10000 ms: Ask timed out on [ActorSelection[Anchor(akka.tcp://flink#10.142.0.40:6123/), Path(/user/resourcemanager)]] after [10000 ms]. Sender[null] sent message of type "akka.actor.Identify"..
2019-06-25 05:45:47,660 INFO org.apache.flink.runtime.taskexecutor.TaskExecutor - Could not resolve ResourceManager address akka.tcp://flink#10.142.0.40:6123/user/resourcemanager, retrying in 10000 ms: Ask timed out on [ActorSelection[Anchor(akka.tcp://flink#10.142.0.40:6123/), Path(/user/resourcemanager)]] after [10000 ms]. Sender[null] sent message of type "akka.actor.Identify"..
2019-06-25 05:46:07,700 INFO org.apache.flink.runtime.taskexecutor.TaskExecutor - Could not resolve ResourceManager address akka.tcp://flink#10.142.0.40:6123/user/resourcemanager, retrying in 10000 ms: Ask timed out on [ActorSelection[Anchor(akka.tcp://flink#10.142.0.40:6123/), Path(/user/resourcemanager)]] after [10000 ms]. Sender[null] sent message of type "akka.actor.Identify"..
2019-06-25 05:46:27,741 INFO org.apache.flink.runtime.taskexecutor.TaskExecutor - Could not resolve ResourceManager address akka.tcp://flink#10.142.0.40:6123/user/resourcemanager, retrying in 10000 ms: Ask timed out on [ActorSelection[Anchor(akka.tcp://flink#10.142.0.40:6123/), Path(/user/resourcemanager)]] after [10000 ms]. Sender[null] sent message of type "akka.actor.Identify"..
2019-06-25 05:46:47,780 INFO org.apache.flink.runtime.taskexecutor.TaskExecutor - Could not resolve ResourceManager address akka.tcp://flink#10.142.0.40:6123/user/resourcemanager, retrying in 10000 ms: Ask timed out on [ActorSelection[Anchor(akka.tcp://flink#10.142.0.40:6123/), Path(/user/resourcemanager)]] after [10000 ms]. Sender[null] sent message of type "akka.actor.Identify"..
2019-06-25 05:47:07,820 INFO org.apache.flink.runtime.taskexecutor.TaskExecutor - Could not resolve ResourceManager address akka.tcp://flink#10.142.0.40:6123/user/resourcemanager, retrying in 10000 ms: Ask timed out on [ActorSelection[Anchor(akka.tcp://flink#10.142.0.40:6123/), Path(/user/resourcemanager)]] after [10000 ms]. Sender[null] sent message of type "akka.actor.Identify"..
2019-06-25 05:47:27,860 INFO org.apache.flink.runtime.taskexecutor.TaskExecutor - Could not resolve ResourceManager address akka.tcp://flink#10.142.0.40:6123/user/resourcemanager, retrying in 10000 ms: Ask timed out on [ActorSelection[Anchor(akka.tcp://flink#10.142.0.40:6123/), Path(/user/resourcemanager)]] after [10000 ms]. Sender[null] sent message of type "akka.actor.Identify"..
2019-06-25 05:47:47,900 INFO org.apache.flink.runtime.taskexecutor.TaskExecutor - Could not resolve ResourceManager address akka.tcp://flink#10.142.0.40:6123/user/resourcemanager, retrying in 10000 ms: Ask timed out on [ActorSelection[Anchor(akka.tcp://flink#10.142.0.40:6123/), Path(/user/resourcemanager)]] after [10000 ms]. Sender[null] sent message of type "akka.actor.Identify"..
2019-06-25 05:48:07,940 INFO org.apache.flink.runtime.taskexecutor.TaskExecutor - Could not resolve ResourceManager address akka.tcp://flink#10.142.0.40:6123/user/resourcemanager, retrying in 10000 ms: Ask timed out on [ActorSelection[Anchor(akka.tcp://flink#10.142.0.40:6123/), Path(/user/resourcemanager)]] after [10000 ms]. Sender[null] sent message of type "akka.actor.Identify"..
2019-06-25 05:48:27,980 INFO org.apache.flink.runtime.taskexecutor.TaskExecutor - Could not resolve ResourceManager address akka.tcp://flink#10.142.0.40:6123/user/resourcemanager, retrying in 10000 ms: Ask timed out on [ActorSelection[Anchor(akka.tcp://flink#10.142.0.40:6123/), Path(/user/resourcemanager)]] after [10000 ms]. Sender[null] sent message of type "akka.actor.Identify"..
2019-06-25 05:48:48,020 INFO org.apache.flink.runtime.taskexecutor.TaskExecutor - Could not resolve ResourceManager address akka.tcp://flink#10.142.0.40:6123/user/resourcemanager, retrying in 10000 ms: Ask timed out on [ActorSelection[Anchor(akka.tcp://flink#10.142.0.40:6123/), Path(/user/resourcemanager)]] after [10000 ms]. Sender[null] sent message of type "akka.actor.Identify"..
2019-06-25 05:49:08,060 INFO org.apache.flink.runtime.taskexecutor.TaskExecutor - Could not resolve ResourceManager address akka.tcp://flink#10.142.0.40:6123/user/resourcemanager, retrying in 10000 ms: Ask timed out on [ActorSelection[Anchor(akka.tcp://flink#10.142.0.40:6123/), Path(/user/resourcemanager)]] after [10000 ms]. Sender[null] sent message of type "akka.actor.Identify"..
2019-06-25 05:49:28,100 INFO org.apache.flink.runtime.taskexecutor.TaskExecutor - Could not resolve ResourceManager address akka.tcp://flink#10.142.0.40:6123/user/resourcemanager, retrying in 10000 ms: Ask timed out on [ActorSelection[Anchor(akka.tcp://flink#10.142.0.40:6123/), Path(/user/resourcemanager)]] after [10000 ms]. Sender[null] sent message of type "akka.actor.Identify"..
2019-06-25 05:49:37,541 ERROR org.apache.flink.runtime.taskexecutor.TaskExecutor - Fatal error occurred in TaskExecutor akka.tcp://flink#ubuntu-test-2:33391/user/taskmanager_0.
org.apache.flink.runtime.taskexecutor.exceptions.RegistrationTimeoutException: Could not register at the ResourceManager within the specified maximum registration duration 300000 ms. This indicates a problem with this instance. Terminating now.
at org.apache.flink.runtime.taskexecutor.TaskExecutor.registrationTimeout(TaskExecutor.java:1037)
at org.apache.flink.runtime.taskexecutor.TaskExecutor.lambda$startRegistrationTimeout$3(TaskExecutor.java:1023)
at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:332)
at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:158)
at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.onReceive(AkkaRpcActor.java:142)
at akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:165)
at akka.actor.Actor$class.aroundReceive(Actor.scala:502)
at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:95)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)
at akka.actor.ActorCell.invoke(ActorCell.scala:495)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)
at akka.dispatch.Mailbox.run(Mailbox.scala:224)
at akka.dispatch.Mailbox.exec(Mailbox.scala:234)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
2019-06-25 05:49:37,544 ERROR org.apache.flink.runtime.taskexecutor.TaskManagerRunner - Fatal error occurred while executing the TaskManager. Shutting it down...
org.apache.flink.runtime.taskexecutor.exceptions.RegistrationTimeoutException: Could not register at the ResourceManager within the specified maximum registration duration 300000 ms. This indicates a problem with this instance. Terminating now.
at org.apache.flink.runtime.taskexecutor.TaskExecutor.registrationTimeout(TaskExecutor.java:1037)
at org.apache.flink.runtime.taskexecutor.TaskExecutor.lambda$startRegistrationTimeout$3(TaskExecutor.java:1023)
at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:332)
at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:158)
at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.onReceive(AkkaRpcActor.java:142)
at akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:165)
at akka.actor.Actor$class.aroundReceive(Actor.scala:502)
at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:95)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)
at akka.actor.ActorCell.invoke(ActorCell.scala:495)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)
at akka.dispatch.Mailbox.run(Mailbox.scala:224)
at akka.dispatch.Mailbox.exec(Mailbox.scala:234)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
2019-06-25 05:49:37,550 INFO org.apache.flink.runtime.taskexecutor.TaskExecutor - Stopping TaskExecutor akka.tcp://flink#ubuntu-test-2:33391/user/taskmanager_0.
2019-06-25 05:49:37,551 INFO org.apache.flink.runtime.state.TaskExecutorLocalStateStoresManager - Shutting down TaskExecutorLocalStateStoresManager.
2019-06-25 05:49:37,554 INFO org.apache.flink.runtime.io.disk.iomanager.IOManager - I/O manager removed spill file directory /tmp/flink-io-9b6408aa-3a29-477b-8a4b-661401bad5b6
2019-06-25 05:49:37,554 INFO org.apache.flink.runtime.io.network.NetworkEnvironment - Shutting down the network environment and its components.
2019-06-25 05:49:37,554 INFO org.apache.flink.runtime.io.network.netty.NettyClient - Successful shutdown (took 0 ms).
2019-06-25 05:49:37,555 INFO org.apache.flink.runtime.io.network.netty.NettyServer - Successful shutdown (took 0 ms).
2019-06-25 05:49:37,561 INFO org.apache.flink.runtime.taskexecutor.JobLeaderService - Stop job leader service.
2019-06-25 05:49:37,562 INFO org.apache.flink.runtime.taskexecutor.TaskExecutor - Stopped TaskExecutor akka.tcp://flink#ubuntu-test-2:33391/user/taskmanager_0.
2019-06-25 05:49:37,563 INFO org.apache.flink.runtime.blob.PermanentBlobCache - Shutting down BLOB cache
2019-06-25 05:49:37,563 INFO org.apache.flink.runtime.blob.TransientBlobCache - Shutting down BLOB cache
2019-06-25 05:49:37,570 INFO org.apache.flink.runtime.rpc.akka.AkkaRpcService - Stopping Akka RPC service.
2019-06-25 05:49:37,576 INFO akka.remote.RemoteActorRefProvider$RemotingTerminator - Shutting down remote daemon.
2019-06-25 05:49:37,577 INFO akka.remote.RemoteActorRefProvider$RemotingTerminator - Remote daemon shut down; proceeding with flushing remote transports.
2019-06-25 05:49:37,580 INFO akka.remote.RemoteActorRefProvider$RemotingTerminator - Shutting down remote daemon.
2019-06-25 05:49:37,584 INFO akka.remote.RemoteActorRefProvider$RemotingTerminator - Remote daemon shut down; proceeding with flushing remote transports.
2019-06-25 05:49:37,596 INFO akka.remote.RemoteActorRefProvider$RemotingTerminator - Remoting shut down.
2019-06-25 05:49:37,597 INFO akka.remote.RemoteActorRefProvider$RemotingTerminator - Remoting shut down. 41,1 Top
Looks like the problem was that I used IP addresses instead of hostnames. This was already pointed out in some other thread on SO. When I read that thread, I thought the reason was because IP addresses can change over time for the same host. Looks like, using IP addresses does not work, even if they don't change.
Wondering why then, in flink documentation, they showed IP addresses.
https://ci.apache.org/projects/flink/flink-docs-release-1.8/ops/deployment/cluster_setup.html
I had same issue,
make sure you are using jdk-1.8 as flink 1.7.2 need jdk-1.8, worked for me!
check if below environment variable set, while docker setup.
FLINK_PROPERTIES="jobmanager.rpc.address: jobmanager"
or check jobmanager.rpc.address configuration in other cases.

Resume Flink when yarn crashes

I am running a yarn 3 node cluster on EMR(1 Master 2 Core nodes). I am using 1.6.0. I have check-pointing enabled(rocksdb), writing to S3. Check-pointing seems to work correctly in other tests. In the case where yarn crashes(In this case, I killed the yarn processes) on the master node, I an unable to resume my application from the last checkpoint. Here is the output when I try and restart:
[hadoop#emr flink-1.6.0]$ bin/flink run -s s3://bucket/kinesis-pipeline-checkpoint/a8a9ceb95845c3ea9833e025b5771470 -p 1 -d ~/pipeline-assembly-0.2.0.jar
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/home/hadoop/flink-1.6.0/lib/slf4j-log4j12-1.7.7.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/lib/hadoop/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
2018-11-08 19:01:06,069 INFO org.apache.flink.yarn.cli.FlinkYarnSessionCli - Found Yarn properties file under /tmp/.yarn-properties-hadoop.
2018-11-08 19:01:06,069 INFO org.apache.flink.yarn.cli.FlinkYarnSessionCli - Found Yarn properties file under /tmp/.yarn-properties-hadoop.
2018-11-08 19:01:06,488 INFO org.apache.flink.yarn.cli.FlinkYarnSessionCli - YARN properties set default parallelism to 1
2018-11-08 19:01:06,488 INFO org.apache.flink.yarn.cli.FlinkYarnSessionCli - YARN properties set default parallelism to 1
YARN properties set default parallelism to 1
2018-11-08 19:01:06,637 INFO org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at emr:8032
2018-11-08 19:01:06,745 INFO org.apache.flink.yarn.cli.FlinkYarnSessionCli - No path for the flink jar passed. Using the location of class org.apache.flink.yarn.YarnClusterDescriptor to locate the jar
2018-11-08 19:01:06,745 INFO org.apache.flink.yarn.cli.FlinkYarnSessionCli - No path for the flink jar passed. Using the location of class org.apache.flink.yarn.YarnClusterDescriptor to locate the jar
2018-11-08 19:01:06,845 INFO org.apache.flink.yarn.AbstractYarnClusterDescriptor - Found application JobManager host name 'emr' and port '39541' from supplied application id 'application_1541703591281_0001'
Starting execution of program
------------------------------------------------------------
The program finished with the following exception:
org.apache.flink.client.program.ProgramInvocationException: Could not submit job (JobID: c701b6511ad76b5e4faae703763f388e)
at org.apache.flink.client.program.rest.RestClusterClient.submitJob(RestClusterClient.java:249)
at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:486)
at org.apache.flink.client.program.DetachedEnvironment.finalizeExecute(DetachedEnvironment.java:77)
at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:432)
at org.apache.flink.client.cli.CliFrontend.executeProgram(CliFrontend.java:804)
at org.apache.flink.client.cli.CliFrontend.runProgram(CliFrontend.java:280)
at org.apache.flink.client.cli.CliFrontend.run(CliFrontend.java:215)
at org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:1044)
at org.apache.flink.client.cli.CliFrontend.lambda$main$11(CliFrontend.java:1120)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1836)
at org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:1120)
Caused by: org.apache.flink.runtime.client.JobSubmissionException: Failed to submit JobGraph.
at org.apache.flink.client.program.rest.RestClusterClient.lambda$submitJob$8(RestClusterClient.java:379)
at java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:870)
at java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(CompletableFuture.java:852)
at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
at org.apache.flink.runtime.concurrent.FutureUtils.lambda$retryOperationWithDelay$5(FutureUtils.java:213)
at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760)
at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736)
at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
at java.util.concurrent.CompletableFuture.postFire(CompletableFuture.java:561)
at java.util.concurrent.CompletableFuture$UniCompose.tryFire(CompletableFuture.java:929)
at java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:442)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.util.concurrent.CompletionException: org.apache.flink.runtime.concurrent.FutureUtils$RetryException: Could not complete the operation. Exception is not retryable.
at java.util.concurrent.CompletableFuture.encodeRelay(CompletableFuture.java:326)
at java.util.concurrent.CompletableFuture.completeRelay(CompletableFuture.java:338)
at java.util.concurrent.CompletableFuture.uniRelay(CompletableFuture.java:911)
at java.util.concurrent.CompletableFuture$UniRelay.tryFire(CompletableFuture.java:899)
... 12 more
Caused by: org.apache.flink.runtime.concurrent.FutureUtils$RetryException: Could not complete the operation. Exception is not retryable.
... 10 more
Caused by: java.util.concurrent.CompletionException: org.apache.flink.runtime.rest.util.RestClientException: [Job submission failed.]
at java.util.concurrent.CompletableFuture.encodeRelay(CompletableFuture.java:326)
at java.util.concurrent.CompletableFuture.completeRelay(CompletableFuture.java:338)
at java.util.concurrent.CompletableFuture.uniRelay(CompletableFuture.java:911)
at java.util.concurrent.CompletableFuture.uniCompose(CompletableFuture.java:953)
at java.util.concurrent.CompletableFuture$UniCompose.tryFire(CompletableFuture.java:926)
... 4 more
Caused by: org.apache.flink.runtime.rest.util.RestClientException: [Job submission failed.]
at org.apache.flink.runtime.rest.RestClient.parseResponse(RestClient.java:310)
at org.apache.flink.runtime.rest.RestClient.lambda$submitRequest$3(RestClient.java:294)
at java.util.concurrent.CompletableFuture.uniCompose(CompletableFuture.java:952)
... 5 more
Is this expected behavior, or am I doing something wrong in this situation?
Thank you
UPDATE: jobmanager.log
LogType:jobmanager.log
Log Upload Time:Tue Nov 20 16:37:52 +0000 2018
LogLength:49255
Log Contents:
2018-11-20 16:33:33,276 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - --------------------------------------------------------------------------------
2018-11-20 16:33:33,277 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Starting YarnSessionClusterEntrypoint (Version: 1.6.0, Rev:ff472b4, Date:07.08.2018 # 13:31:13 UTC)
2018-11-20 16:33:33,278 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - OS current user: yarn
2018-11-20 16:33:33,672 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Current Hadoop/Kerberos user: hadoop
2018-11-20 16:33:33,672 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - JVM: OpenJDK 64-Bit Server VM - Oracle Corporation - 1.8/25.181-b13
2018-11-20 16:33:33,672 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Maximum heap size: 13653 MiBytes
2018-11-20 16:33:33,672 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - JAVA_HOME: /usr/lib/jvm/java-openjdk
2018-11-20 16:33:33,673 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Hadoop version: 2.8.3
2018-11-20 16:33:33,673 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - JVM Options:
2018-11-20 16:33:33,673 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - -Xmx15360m
2018-11-20 16:33:33,673 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - -Dlog.file=/var/log/hadoop-yarn/containers/application_1542731534971_0001/container_1542731534971_0001_01_000001/jobmanager.log
2018-11-20 16:33:33,673 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - -Dlogback.configurationFile=file:logback.xml
2018-11-20 16:33:33,673 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - -Dlog4j.configuration=file:log4j.properties
2018-11-20 16:33:33,673 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Program Arguments: (none)
2018-11-20 16:33:33,674 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - --------------------------------------------------------------------------------
2018-11-20 16:33:33,675 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Registered UNIX signal handlers for [TERM, HUP, INT]
2018-11-20 16:33:33,678 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - YARN daemon is running as: hadoop Yarn client user obtainer: hadoop
2018-11-20 16:33:33,680 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: state.backend.fs.checkpointdir, s3://bucket/kinesis-checkpoint
2018-11-20 16:33:33,680 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: web.timeout, 60000
2018-11-20 16:33:33,680 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: high-availability.cluster-id, application_1542731534971_0001
2018-11-20 16:33:33,680 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: jobmanager.rpc.address, localhost
2018-11-20 16:33:33,681 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: jobmanager.rpc.port, 6123
2018-11-20 16:33:33,681 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: rest.port, 8081
2018-11-20 16:33:33,681 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: internal.cluster.execution-mode, NORMAL
2018-11-20 16:33:33,681 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: taskmanager.memory.fraction, 0.9
2018-11-20 16:33:33,681 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: parallelism.default, 1
2018-11-20 16:33:33,681 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: taskmanager.numberOfTaskSlots, 1
2018-11-20 16:33:33,681 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: state.backend, rocksdb
2018-11-20 16:33:33,681 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: akka.ask.timeout, 60s
2018-11-20 16:33:33,681 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: taskmanager.heap.size, 20480m
2018-11-20 16:33:33,682 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: jobmanager.heap.size, 20480m
2018-11-20 16:33:33,682 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: state.checkpoints.dir, s3://bucket/kinesis-checkpoint
2018-11-20 16:33:33,695 INFO org.apache.flink.runtime.clusterframework.BootstrapTools - Setting directories for temporary files to: /mnt/yarn/usercache/hadoop/appcache/application_1542731534971_0001
2018-11-20 16:33:33,708 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Starting YarnSessionClusterEntrypoint.
2018-11-20 16:33:33,708 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Install default filesystem.
2018-11-20 16:33:33,772 INFO org.apache.flink.runtime.security.modules.HadoopModule - Hadoop user set to hadoop (auth:SIMPLE)
2018-11-20 16:33:33,786 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Initializing cluster services.
2018-11-20 16:33:33,791 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Trying to start actor system at ip-172-31-18-80.us-west-2.compute.internal:45751
2018-11-20 16:33:34,239 INFO akka.event.slf4j.Slf4jLogger - Slf4jLogger started
2018-11-20 16:33:34,328 INFO akka.remote.Remoting - Starting remoting
2018-11-20 16:33:34,428 INFO akka.remote.Remoting - Remoting started; listening on addresses :[akka.tcp://flink#ip-172-31-18-80.us-west-2.compute.internal:45751]
2018-11-20 16:33:34,437 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Actor system started at akka.tcp://flink#ip-172-31-18-80.us-west-2.compute.internal:45751
2018-11-20 16:33:34,469 INFO org.apache.flink.runtime.blob.BlobServer - Created BLOB server storage directory /mnt/yarn/usercache/hadoop/appcache/application_1542731534971_0001/blobStore-1dc43ec8-8ed7-4342-adae-c8d20a691640
2018-11-20 16:33:34,473 INFO org.apache.flink.runtime.blob.BlobServer - Started BLOB server at 0.0.0.0:39955 - max concurrent requests: 50 - max backlog: 1000
2018-11-20 16:33:34,488 INFO org.apache.flink.runtime.metrics.MetricRegistryImpl - No metrics reporter configured, no metrics will be exposed/reported.
2018-11-20 16:33:34,492 INFO org.apache.flink.runtime.dispatcher.FileArchivedExecutionGraphStore - Initializing FileArchivedExecutionGraphStore: Storage directory /mnt/yarn/usercache/hadoop/appcache/application_1542731534971_0001/executionGraphStore-0c4fd7ac-17d2-40d6-b279-dfef5041a76f, expiration time 3600000, maximum cache size 52428800 bytes.
2018-11-20 16:33:34,514 INFO org.apache.flink.runtime.blob.TransientBlobCache - Created BLOB cache storage directory /mnt/yarn/usercache/hadoop/appcache/application_1542731534971_0001/blobStore-4c662c5c-afa5-4bf2-8a01-3acc0b9aa491
2018-11-20 16:33:34,521 WARN org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint - Upload directory /tmp/flink-web-6885656b-18cc-451f-8853-03ff7cf14b0e/flink-web-upload does not exist, or has been deleted externally. Previously uploaded files are no longer available.
2018-11-20 16:33:34,522 INFO org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint - Created directory /tmp/flink-web-6885656b-18cc-451f-8853-03ff7cf14b0e/flink-web-upload for file uploads.
2018-11-20 16:33:34,525 INFO org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint - Starting rest endpoint.
2018-11-20 16:33:34,702 INFO org.apache.flink.runtime.webmonitor.WebMonitorUtils - Determined location of main cluster component log file: /var/log/hadoop-yarn/containers/application_1542731534971_0001/container_1542731534971_0001_01_000001/jobmanager.log
2018-11-20 16:33:34,702 INFO org.apache.flink.runtime.webmonitor.WebMonitorUtils - Determined location of main cluster component stdout file: /var/log/hadoop-yarn/containers/application_1542731534971_0001/container_1542731534971_0001_01_000001/jobmanager.out
2018-11-20 16:33:34,844 INFO org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint - Rest endpoint listening at ip-172-31-18-80.us-west-2.compute.internal:35939
2018-11-20 16:33:34,844 INFO org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint - http://ip-172-31-18-80.us-west-2.compute.internal:35939 was granted leadership with leaderSessionID=00000000-0000-0000-0000-000000000000
2018-11-20 16:33:34,844 INFO org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint - Web frontend listening at http://ip-172-31-18-80.us-west-2.compute.internal:35939.
2018-11-20 16:33:34,857 INFO org.apache.flink.runtime.rpc.akka.AkkaRpcService - Starting RPC endpoint for org.apache.flink.yarn.YarnResourceManager at akka://flink/user/resourcemanager .
2018-11-20 16:33:34,948 INFO org.apache.flink.runtime.rpc.akka.AkkaRpcService - Starting RPC endpoint for org.apache.flink.runtime.dispatcher.StandaloneDispatcher at akka://flink/user/dispatcher .
2018-11-20 16:33:34,981 INFO org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at ip-172-31-30-52.us-west-2.compute.internal/172.31.30.52:8030
2018-11-20 16:33:35,234 INFO org.apache.flink.yarn.YarnResourceManager - Recovered 0 containers from previous attempts ([]).
2018-11-20 16:33:35,237 INFO org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy - yarn.client.max-cached-nodemanagers-proxies : 0
2018-11-20 16:33:35,238 INFO org.apache.flink.yarn.YarnResourceManager - ResourceManager akka.tcp://flink#ip-172-31-18-80.us-west-2.compute.internal:45751/user/resourcemanager was granted leadership with fencing token 00000000000000000000000000000000
2018-11-20 16:33:35,239 INFO org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager - Starting the SlotManager.
2018-11-20 16:33:35,252 INFO org.apache.flink.runtime.dispatcher.StandaloneDispatcher - Dispatcher akka.tcp://flink#ip-172-31-18-80.us-west-2.compute.internal:45751/user/dispatcher was granted leadership with fencing token 00000000-0000-0000-0000-000000000000
2018-11-20 16:33:35,252 INFO org.apache.flink.runtime.dispatcher.StandaloneDispatcher - Recovering all persisted jobs.
2018-11-20 16:34:20,094 INFO org.apache.flink.runtime.dispatcher.StandaloneDispatcher - Submitting job bd0d5dbaeba3990a3bef1eebee49cd79 (Data Session Pipeline v0.0.7).
2018-11-20 16:34:20,108 INFO org.apache.flink.runtime.rpc.akka.AkkaRpcService - Starting RPC endpoint for org.apache.flink.runtime.jobmaster.JobMaster at akka://flink/user/jobmanager_0 .
2018-11-20 16:34:20,115 INFO org.apache.flink.runtime.jobmaster.JobMaster - Initializing job Data Session Pipeline v0.0.7 (bd0d5dbaeba3990a3bef1eebee49cd79).
2018-11-20 16:34:20,124 INFO org.apache.flink.runtime.jobmaster.JobMaster - Using restart strategy FixedDelayRestartStrategy(maxNumberRestartAttempts=2147483647, delayBetweenRestartAttempts=0) for Data Session Pipeline v0.0.7 (bd0d5dbaeba3990a3bef1eebee49cd79).
2018-11-20 16:34:20,127 INFO org.apache.flink.runtime.rpc.akka.AkkaRpcService - Starting RPC endpoint for org.apache.flink.runtime.jobmaster.slotpool.SlotPool at akka://flink/user/0e6f5de3-53ad-4bae-acf3-3c66106c0a54 .
2018-11-20 16:34:20,148 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Job recovers via failover strategy: full graph restart
2018-11-20 16:34:20,170 INFO org.apache.flink.runtime.jobmaster.JobMaster - Running initialization on master for job Data Session Pipeline v0.0.7 (bd0d5dbaeba3990a3bef1eebee49cd79).
2018-11-20 16:34:20,170 INFO org.apache.flink.runtime.jobmaster.JobMaster - Successfully ran initialization on master in 0 ms.
2018-11-20 16:34:20,203 INFO org.apache.flink.runtime.jobmaster.JobMaster - Using application-defined state backend: RocksDBStateBackend{checkpointStreamBackend=File State Backend (checkpoints: 's3://bucket/kinesis-checkpoint', savepoints: 'null', asynchronous: UNDEFINED, fileStateThreshold: -1), localRocksDbDirectories=null, enableIncrementalCheckpointing=TRUE}
2018-11-20 16:34:20,203 INFO org.apache.flink.runtime.jobmaster.JobMaster - Configuring application-defined state backend with job/cluster config
2018-11-20 16:34:22,624 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Starting job bd0d5dbaeba3990a3bef1eebee49cd79 from savepoint s3://bucket/kinesis-pipeline-checkpoint/8a6e5aeebeef202a2daddd3cf9419a80 ()
2018-11-20 16:34:22,663 ERROR org.apache.flink.runtime.rest.handler.job.JobSubmitHandler - Exception occurred in REST handler.
org.apache.flink.runtime.rest.handler.RestHandlerException: Job submission failed.
at org.apache.flink.runtime.rest.handler.job.JobSubmitHandler.lambda$handleRequest$2(JobSubmitHandler.java:119)
at java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:870)
at java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(CompletableFuture.java:852)
at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
at org.apache.flink.runtime.concurrent.FutureUtils$1.onComplete(FutureUtils.java:770)
at akka.dispatch.OnComplete.internal(Future.scala:258)
at akka.dispatch.OnComplete.internal(Future.scala:256)
at akka.dispatch.japi$CallbackBridge.apply(Future.scala:186)
at akka.dispatch.japi$CallbackBridge.apply(Future.scala:183)
at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:36)
at org.apache.flink.runtime.concurrent.Executors$DirectExecutionContext.execute(Executors.java:83)
at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:44)
at scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:252)
at akka.pattern.PromiseActorRef.$bang(AskSupport.scala:534)
at akka.pattern.PipeToSupport$PipeableFuture$$anonfun$pipeTo$1.applyOrElse(PipeToSupport.scala:20)
at akka.pattern.PipeToSupport$PipeableFuture$$anonfun$pipeTo$1.applyOrElse(PipeToSupport.scala:18)
at scala.concurrent.Future$$anonfun$andThen$1.apply(Future.scala:436)
at scala.concurrent.Future$$anonfun$andThen$1.apply(Future.scala:435)
at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:36)
at akka.dispatch.BatchingExecutor$AbstractBatch.processBatch(BatchingExecutor.scala:55)
at akka.dispatch.BatchingExecutor$BlockableBatch$$anonfun$run$1.apply$mcV$sp(BatchingExecutor.scala:91)
at akka.dispatch.BatchingExecutor$BlockableBatch$$anonfun$run$1.apply(BatchingExecutor.scala:91)
at akka.dispatch.BatchingExecutor$BlockableBatch$$anonfun$run$1.apply(BatchingExecutor.scala:91)
at scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:72)
at akka.dispatch.BatchingExecutor$BlockableBatch.run(BatchingExecutor.scala:90)
at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:39)
at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:415)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
Caused by: java.util.concurrent.CompletionException: org.apache.flink.runtime.client.JobSubmissionException: Failed to submit job.
at org.apache.flink.runtime.dispatcher.Dispatcher.lambda$submitJob$2(Dispatcher.java:256)
at java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:870)
at java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(CompletableFuture.java:852)
at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
at java.util.concurrent.CompletableFuture.postFire(CompletableFuture.java:561)
at java.util.concurrent.CompletableFuture$UniRun.tryFire(CompletableFuture.java:690)
at java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:442)
at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:332)
at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:158)
at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:70)
at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.onReceive(AkkaRpcActor.java:142)
at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.onReceive(FencedAkkaRpcActor.java:40)
at akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:165)
at akka.actor.Actor$class.aroundReceive(Actor.scala:502)
at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:95)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)
at akka.actor.ActorCell.invoke(ActorCell.scala:495)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)
at akka.dispatch.Mailbox.run(Mailbox.scala:224)
at akka.dispatch.Mailbox.exec(Mailbox.scala:234)
... 4 more
Caused by: org.apache.flink.runtime.client.JobSubmissionException: Failed to submit job.
... 24 more
Caused by: java.util.concurrent.CompletionException: java.lang.RuntimeException: org.apache.flink.runtime.client.JobExecutionException: Could not set up JobManager
at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:273)
at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:280)
at java.util.concurrent.CompletableFuture.uniRun(CompletableFuture.java:708)
at java.util.concurrent.CompletableFuture$UniRun.tryFire(CompletableFuture.java:687)
... 18 more
Caused by: java.lang.RuntimeException: org.apache.flink.runtime.client.JobExecutionException: Could not set up JobManager
at org.apache.flink.util.ExceptionUtils.rethrow(ExceptionUtils.java:199)
at org.apache.flink.util.function.ConsumerWithException.accept(ConsumerWithException.java:40)
at org.apache.flink.runtime.dispatcher.Dispatcher.lambda$waitForTerminatingJobManager$29(Dispatcher.java:820)
at java.util.concurrent.CompletableFuture.uniRun(CompletableFuture.java:705)
... 19 more
Caused by: org.apache.flink.runtime.client.JobExecutionException: Could not set up JobManager
at org.apache.flink.runtime.jobmaster.JobManagerRunner.<init>(JobManagerRunner.java:176)
at org.apache.flink.runtime.dispatcher.Dispatcher$DefaultJobManagerRunnerFactory.createJobManagerRunner(Dispatcher.java:936)
at org.apache.flink.runtime.dispatcher.Dispatcher.createJobManagerRunner(Dispatcher.java:291)
at org.apache.flink.runtime.dispatcher.Dispatcher.runJob(Dispatcher.java:281)
at org.apache.flink.runtime.dispatcher.Dispatcher.persistAndRunJob(Dispatcher.java:266)
at org.apache.flink.util.function.ConsumerWithException.accept(ConsumerWithException.java:38)
... 21 more
Caused by: java.io.FileNotFoundException: Cannot find meta data file '_metadata' in directory 's3://sledfs/kinesis-pipeline-checkpoint/8a6e5aeebeef202a2daddd3cf9419a80'. Please try to load the checkpoint/savepoint directly from the metadata file instead of the directory.
at org.apache.flink.runtime.state.filesystem.AbstractFsCheckpointStorage.resolveCheckpointPointer(AbstractFsCheckpointStorage.java:256)
at org.apache.flink.runtime.state.filesystem.AbstractFsCheckpointStorage.resolveCheckpoint(AbstractFsCheckpointStorage.java:109)
at org.apache.flink.runtime.checkpoint.CheckpointCoordinator.restoreSavepoint(CheckpointCoordinator.java:1102)
at org.apache.flink.runtime.jobmaster.JobMaster.tryRestoreExecutionGraphFromSavepoint(JobMaster.java:1220)
at org.apache.flink.runtime.jobmaster.JobMaster.createAndRestoreExecutionGraph(JobMaster.java:1144)
at org.apache.flink.runtime.jobmaster.JobMaster.<init>(JobMaster.java:295)
at org.apache.flink.runtime.jobmaster.JobManagerRunner.<init>(JobManagerRunner.java:157)
... 26 more
2018-11-20 16:37:52,321 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - RECEIVED SIGNAL 15: SIGTERM. Shutting down as requested.
2018-11-20 16:37:52,322 INFO org.apache.flink.runtime.blob.TransientBlobCache - Shutting down BLOB cache
2018-11-20 16:37:52,340 INFO org.apache.flink.runtime.blob.BlobServer - Stopped BLOB server at 0.0.0.0:39955
The checkpoint you are referring to s3://bucket/kinesis-pipeline-checkpoint/a8a9ceb95845c3ea9833e025b5771470 does not contain a valid _metadata file. This indicates that this checkpoint was started but could not be completed. Please choose a checkpoint which has been successfully completed.

Flink job fails after 10 minutes from initialization

I'm having problems with flink application fail.
This streaming job runs shortly after deploying on Yarn.
But is fails after some minutes with below error messages.
Can it be the evidence of high load in low performance yarn cluster?
1.5.0 flink and yarn single job
Single node is equipped with 100GBytes RAM and 40 v-cores
48 Yarn node manager.
2 Kafka topic input ( 150GBytes/hour for each input stream. )
480 kafka partition.
10 flink slot per node manager
From the beginning of the flink
Log Type: jobmanager.log
Log Upload Time: Tue Jun 12 18:19:50 +0900 2018
Log Length: 10807897
2018-06-11 18:59:27,167 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - --------------------------------------------------------------------------------
2018-06-11 18:59:27,168 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Starting YarnSessionClusterEntrypoint (Version: 1.5.0, Rev:c61b108, Date:24.05.2018 # 14:54:44 UTC)
2018-06-11 18:59:27,168 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - OS current user: irteam
2018-06-11 18:59:27,472 WARN org.apache.hadoop.util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2018-06-11 18:59:27,536 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Current Hadoop/Kerberos user: irteam
2018-06-11 18:59:27,536 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - JVM: OpenJDK 64-Bit Server VM - Oracle Corporation - 1.8/25.161-b14
2018-06-11 18:59:27,536 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Maximum heap size: 66667 MiBytes
2018-06-11 18:59:27,537 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - JAVA_HOME: /usr/lib/jvm/java-1.8.0-openjdk-1.8.0.161-0.b14.el7_4.x86_64
2018-06-11 18:59:27,537 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Hadoop version: 2.8.3
2018-06-11 18:59:27,537 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - JVM Options:
2018-06-11 18:59:27,538 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - -Xmx75000m
2018-06-11 18:59:27,538 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - -Djava.library.path=/home1/irteam/realtime-tools
2018-06-11 18:59:27,538 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - -Dlog.file=/naver/search-cluster/eye/var/logs/application_1528711080009_0002/container_e08_1528711080009_0002_01_000001/jobmanager.log
2018-06-11 18:59:27,538 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - -Dlogback.configurationFile=file:logback.xml
2018-06-11 18:59:27,538 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - -Dlog4j.configuration=file:log4j.properties
2018-06-11 18:59:27,538 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Program Arguments: (none)
2018-06-11 18:59:27,538 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Class path[omit]
2018-06-11 18:59:27,539 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - --------------------------------------------------------------------------------
2018-06-11 18:59:27,539 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Registered UNIX signal handlers for [TERM, HUP, INT]
2018-06-11 18:59:27,542 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - YARN daemon is running as: irteam Yarn client user obtainer: irteam
2018-06-11 18:59:27,544 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: env.java.home, "/usr/lib/jvm/java-1.8.0-openjdk"
2018-06-11 18:59:27,544 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: env.java.opts, "-Djava.library.path=/home1/irteam/realtime-tools"
2018-06-11 18:59:27,545 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: high-availability.cluster-id, application_1528711080009_0002
2018-06-11 18:59:27,545 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: jobmanager.rpc.address, 0.0.0.0
2018-06-11 18:59:27,545 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: jobmanager.heap.mb, 100000
2018-06-11 18:59:27,545 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: taskmanager.network.request-backoff.max, 100000
2018-06-11 18:59:27,545 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: jobmanager.rpc.port, 6123
2018-06-11 18:59:27,545 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: containerized.taskmanager.env.JAVA_HOME, /usr/lib/jvm/java-1.8.0-openjdk-1.8.0.161-0.b14.el7_4.x86_64
2018-06-11 18:59:27,545 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: rest.port, 8081
2018-06-11 18:59:27,545 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: internal.cluster.execution-mode, NORMAL
2018-06-11 18:59:27,545 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: parallelism.default, 480
2018-06-11 18:59:27,546 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: taskmanager.numberOfTaskSlots, 10
2018-06-11 18:59:27,546 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: taskmanager.heap.mb, 100000
2018-06-11 18:59:27,546 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: containerized.master.env.JAVA_HOME, /usr/lib/jvm/java-1.8.0-openjdk-1.8.0.161-0.b14.el7_4.x86_64
2018-06-11 18:59:27,558 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Setting directories for temporary files to: /home1/irteam/naver/search-cluster/eye/volume/nodemanager/usercache/irteam/appcache/application_1528711080009_0002
2018-06-11 18:59:27,570 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Starting YarnSessionClusterEntrypoint.
2018-06-11 18:59:27,570 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Install default filesystem.
2018-06-11 18:59:27,636 INFO org.apache.flink.runtime.security.modules.HadoopModule - Hadoop user set to irteam (auth:SIMPLE)
2018-06-11 18:59:27,650 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Initializing cluster services.
2018-06-11 18:59:27,654 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Trying to start actor system at chd004.eye.nfra.io:33524
2018-06-11 18:59:28,126 INFO akka.event.slf4j.Slf4jLogger - Slf4jLogger started
2018-06-11 18:59:28,222 INFO akka.remote.Remoting - Starting remoting
2018-06-11 18:59:28,322 INFO akka.remote.Remoting - Remoting started; listening on addresses :[akka.tcp://flink#chd004.eye.nfra.io:33524]
2018-06-11 18:59:28,329 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Actor system started at akka.tcp://flink#chd004.eye.nfra.io:33524
2018-06-11 18:59:28,348 INFO org.apache.flink.runtime.blob.BlobServer - Created BLOB server storage directory /home1/irteam/naver/search-cluster/eye/volume/nodemanager/usercache/irteam/appcache/application_1528711080009_0002/blobStore-c25d4d9d-4ddc-442d-8d5e-7bec36dca006
2018-06-11 18:59:28,349 INFO org.apache.flink.runtime.blob.BlobServer - Started BLOB server at 0.0.0.0:45733 - max concurrent requests: 50 - max backlog: 1000
2018-06-11 18:59:28,363 INFO org.apache.flink.runtime.metrics.MetricRegistryImpl - No metrics reporter configured, no metrics will be exposed/reported.
2018-06-11 18:59:28,367 INFO org.apache.flink.runtime.dispatcher.FileArchivedExecutionGraphStore - Initializing FileArchivedExecutionGraphStore: Storage directory /home1/irteam/naver/search-cluster/eye/volume/nodemanager/usercache/irteam/appcache/application_1528711080009_0002/executionGraphStore-63bcf196-410d-4d8c-8388-f270beb53555, expiration time 3600000, maximum cache size 52428800 bytes.
2018-06-11 18:59:28,388 INFO org.apache.flink.runtime.blob.TransientBlobCache - Created BLOB cache storage directory /home1/irteam/naver/search-cluster/eye/volume/nodemanager/usercache/irteam/appcache/application_1528711080009_0002/blobStore-02db740f-8c23-46e8-bb24-1f583b6a0b33
2018-06-11 18:59:28,395 WARN org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint - Upload directory /tmp/flink-web-8698d702-67fe-437c-b62e-78c2969bf770/flink-web-upload does not exist, or has been deleted externally. Previously uploaded files are no longer available.
2018-06-11 18:59:28,396 INFO org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint - Created directory /tmp/flink-web-8698d702-67fe-437c-b62e-78c2969bf770/flink-web-upload for file uploads.
2018-06-11 18:59:28,399 INFO org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint - Starting rest endpoint.
2018-06-11 18:59:28,737 INFO org.apache.flink.runtime.webmonitor.WebMonitorUtils - Determined location of main cluster component log file: /naver/search-cluster/eye/var/logs/application_1528711080009_0002/container_e08_1528711080009_0002_01_000001/jobmanager.log
2018-06-11 18:59:28,737 INFO org.apache.flink.runtime.webmonitor.WebMonitorUtils - Determined location of main cluster component stdout file: /naver/search-cluster/eye/var/logs/application_1528711080009_0002/container_e08_1528711080009_0002_01_000001/jobmanager.out
2018-06-11 18:59:28,808 INFO org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint - Rest endpoint listening at chd004.eye.nfra.io:39794
2018-06-11 18:59:28,808 INFO org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint - http://chd004.eye.nfra.io:39794 was granted leadership with leaderSessionID=00000000-0000-0000-0000-000000000000
2018-06-11 18:59:28,808 INFO org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint - Web frontend listening at http://chd004.eye.nfra.io:39794.
2018-06-11 18:59:28,817 INFO org.apache.flink.runtime.rpc.akka.AkkaRpcService - Starting RPC endpoint for org.apache.flink.yarn.YarnResourceManager at akka://flink/user/resourcemanager .
2018-06-11 18:59:28,902 INFO org.apache.flink.runtime.rpc.akka.AkkaRpcService - Starting RPC endpoint for org.apache.flink.runtime.dispatcher.StandaloneDispatcher at akka://flink/user/dispatcher .
2018-06-11 18:59:28,916 INFO org.apache.flink.yarn.YarnResourceManager - ResourceManager akka.tcp://flink#chd004.eye.nfra.io:33524/user/resourcemanager was granted leadership with fencing token 00000000000000000000000000000000
2018-06-11 18:59:28,917 INFO org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager - Starting the SlotManager.
2018-06-11 18:59:29,161 INFO org.apache.flink.yarn.YarnResourceManager - Recovered 0 containers from previous attempts ([]).
2018-06-11 18:59:29,163 INFO org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy - yarn.client.max-cached-nodemanagers-proxies : 0
2018-06-11 18:59:29,174 INFO org.apache.flink.runtime.dispatcher.StandaloneDispatcher - Dispatcher akka.tcp://flink#chd004.eye.nfra.io:33524/user/dispatcher was granted leadership with fencing token 00000000000000000000000000000000
2018-06-11 18:59:29,174 INFO org.apache.flink.runtime.dispatcher.StandaloneDispatcher - Recovering all persisted jobs.
2018-06-11 18:59:31,120 INFO org.apache.flink.runtime.dispatcher.StandaloneDispatcher - Submitting job 5f090c4f4287db062cee0996da5d5ffc (LCS realtime data).
2018-06-11 18:59:31,130 INFO org.apache.flink.runtime.rpc.akka.AkkaRpcService - Starting RPC endpoint for org.apache.flink.runtime.jobmaster.JobMaster at akka://flink/user/jobmanager_0 .
2018-06-11 18:59:31,136 INFO org.apache.flink.runtime.jobmaster.JobMaster - Initializing job LCS realtime data (5f090c4f4287db062cee0996da5d5ffc).
2018-06-11 18:59:31,144 INFO org.apache.flink.runtime.jobmaster.JobMaster - Using restart strategy FixedDelayRestartStrategy(maxNumberRestartAttempts=3, delayBetweenRestartAttempts=30000) for LCS realtime data (5f090c4f4287db062cee0996da5d5ffc).
2018-06-11 18:59:31,148 INFO org.apache.flink.runtime.rpc.akka.AkkaRpcService - Starting RPC endpoint for org.apache.flink.runtime.jobmaster.slotpool.SlotPool at akka://flink/user/a6ffe322-07db-4282-a29c-0836ad26cd9f .
2018-06-11 18:59:31,165 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Job recovers via failover strategy: full graph restart
2018-06-11 18:59:31,174 INFO org.apache.flink.runtime.jobmaster.JobMaster - Running initialization on master for job LCS realtime data (5f090c4f4287db062cee0996da5d5ffc).
2018-06-11 18:59:31,174 INFO org.apache.flink.runtime.jobmaster.JobMaster - Successfully ran initialization on master in 0 ms.
2018-06-11 18:59:31,248 INFO org.apache.flink.runtime.jobmaster.JobMaster - Using application-defined state backend: File State Backend (checkpoints: 'file:/home1/irteam/apps/flink-1.4.0/checkpoint', savepoints: 'null', asynchronous: UNDEFINED, fileStateThreshold: -1)
2018-06-11 18:59:31,248 INFO org.apache.flink.runtime.jobmaster.JobMaster - Configuring application-defined state backend with job/cluster config
2018-06-11 18:59:31,258 INFO org.apache.flink.runtime.jobmaster.JobManagerRunner - JobManager runner for job LCS realtime data (5f090c4f4287db062cee0996da5d5ffc) was granted leadership with session id 00000000-0000-0000-0000-000000000000 at akka.tcp://flink#chd004.eye.nfra.io:33524/user/jobmanager_0.
2018-06-11 18:59:31,260 INFO org.apache.flink.runtime.jobmaster.JobMaster - Starting execution of job LCS realtime data (5f090c4f4287db062cee0996da5d5ffc)
2018-06-11 18:59:31,261 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Job LCS realtime data (5f090c4f4287db062cee0996da5d5ffc) switched from state CREATED to RUNNING.
2018-06-11 18:59:31,264 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Source: Custom Source (1/480) (98a01166bb2ac99dd301e4b60febbc45) switched from CREATED to SCHEDULED.
Near the timeout event which might cause flink job fails.
2018-06-12 18:17:39,750 INFO org.apache.flink.runtime.rest.handler.legacy.backpressure.StackTraceSampleCoordinator - Cancelling sample 5589
akka.pattern.AskTimeoutException: Ask timed out on [Actor[akka.tcp://flink#chd023.eye.nfra.io:34783/user/taskmanager_0#-297572584]] after [15000 ms]. Sender[null] sent message of type "org.apache.flink.runtime.rpc.messages.RemoteRpcInvocation".
at akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:604)
at akka.actor.Scheduler$$anon$4.run(Scheduler.scala:126)
at scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601)
at scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109)
at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599)
at akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:329)
at akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket$1(LightArrayRevolverScheduler.scala:280)
at akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(LightArrayRevolverScheduler.scala:284)
at akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArrayRevolverScheduler.scala:236)
at java.lang.Thread.run(Thread.java:748)
2018-06-12 18:17:39,770 INFO org.apache.flink.runtime.rest.handler.legacy.backpressure.StackTraceSampleCoordinator - Cancelling sample 5590
akka.pattern.AskTimeoutException: Ask timed out on [Actor[akka.tcp://flink#chd032.eye.nfra.io:34653/user/taskmanager_0#424015125]] after [15000 ms]. Sender[null] sent message of type "org.apache.flink.runtime.rpc.messages.RemoteRpcInvocation".
at akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:604)
at akka.actor.Scheduler$$anon$4.run(Scheduler.scala:126)
at scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601)
at scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109)
at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599)
at akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:329)
at akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket$1(LightArrayRevolverScheduler.scala:280)
at akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(LightArrayRevolverScheduler.scala:284)
at akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArrayRevolverScheduler.scala:236)
at java.lang.Thread.run(Thread.java:748)
2018-06-12 18:17:51,270 INFO org.apache.flink.runtime.rest.handler.legacy.backpressure.StackTraceSampleCoordinator - Cancelling sample 5591
akka.pattern.AskTimeoutException: Ask timed out on [Actor[akka.tcp://flink#chd032.eye.nfra.io:34653/user/taskmanager_0#424015125]] after [15000 ms]. Sender[null] sent message of type "org.apache.flink.runtime.rpc.messages.RemoteRpcInvocation".
at akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:604)
at akka.actor.Scheduler$$anon$4.run(Scheduler.scala:126)
at scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601)
at scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109)
at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599)
at akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:329)
at akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket$1(LightArrayRevolverScheduler.scala:280)
at akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(LightArrayRevolverScheduler.scala:284)
at akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArrayRevolverScheduler.scala:236)
at java.lang.Thread.run(Thread.java:748)
2018-06-12 18:17:55,650 INFO org.apache.flink.yarn.YarnResourceManager - The heartbeat of TaskManager with id container_e08_1528711080009_0002_01_000017 timed out.
2018-06-12 18:17:55,650 INFO org.apache.flink.yarn.YarnResourceManager - Closing TaskExecutor connection container_e08_1528711080009_0002_01_000017 because: The heartbeat of TaskManager with id container_e08_1528711080009_0002_01_000017 timed out.
2018-06-12 18:17:55,650 INFO org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager - Unregister TaskManager 525095d833344e8b205017666accd9c5 from the SlotManager.
2018-06-12 18:17:55,650 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Window(EventTimeSessionWindows(300000), NowTrigger, NowSessionProcessor) -> Sink: Unnamed (188/480) (f9ed2fc23d6ca5a364300864b60760af) switched from RUNNING to FAILED.
org.apache.flink.util.FlinkException: Releasing TaskManager container_e08_1528711080009_0002_01_000017.
at org.apache.flink.runtime.jobmaster.slotpool.SlotPool.releaseTaskManagerInternal(SlotPool.java:1067)
at org.apache.flink.runtime.jobmaster.slotpool.SlotPool.releaseTaskManager(SlotPool.java:1050)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcInvocation(AkkaRpcActor.java:247)
at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:162)
at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.onReceive(AkkaRpcActor.java:142)
at akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:165)
at akka.actor.Actor$class.aroundReceive(Actor.scala:502)
at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:95)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)
at akka.actor.ActorCell.invoke(ActorCell.scala:495)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)
at akka.dispatch.Mailbox.run(Mailbox.scala:224)
at akka.dispatch.Mailbox.exec(Mailbox.scala:234)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
2018-06-12 18:17:55,651 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Job LCS realtime data (5f090c4f4287db062cee0996da5d5ffc) switched from state RUNNING to FAILING.
org.apache.flink.util.FlinkException: Releasing TaskManager container_e08_1528711080009_0002_01_000017.
at org.apache.flink.runtime.jobmaster.slotpool.SlotPool.releaseTaskManagerInternal(SlotPool.java:1067)
at org.apache.flink.runtime.jobmaster.slotpool.SlotPool.releaseTaskManager(SlotPool.java:1050)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcInvocation(AkkaRpcActor.java:247)
at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:162)
at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.onReceive(AkkaRpcActor.java:142)
at akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:165)
at akka.actor.Actor$class.aroundReceive(Actor.scala:502)
at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:95)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)
at akka.actor.ActorCell.invoke(ActorCell.scala:495)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)
at akka.dispatch.Mailbox.run(Mailbox.scala:224)
at akka.dispatch.Mailbox.exec(Mailbox.scala:234)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
2018-06-12 18:17:55,679 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Source: Custom Source (1/480) (98a01166bb2ac99dd301e4b60febbc45) switched from RUNNING to CANCELING.

How to run flink scala shell in yarn mode

I try to launch flink scala shell in yarn mode, but hit the following error.
This is the command I use, Do I miss anything ? Thanks
bin/start-scala-shell.sh yarn -n 2
Starting Flink Shell:
2018-06-04 17:31:18,166 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: jobmanager.rpc.address, localhost
2018-06-04 17:31:18,168 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: jobmanager.rpc.port, 6123
2018-06-04 17:31:18,168 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: jobmanager.heap.mb, 1024
2018-06-04 17:31:18,168 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: taskmanager.heap.mb, 1024
2018-06-04 17:31:18,169 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: taskmanager.numberOfTaskSlots, 1
2018-06-04 17:31:18,169 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: parallelism.default, 1
2018-06-04 17:31:18,169 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: rest.port, 8081
Exception in thread "main" java.lang.UnsupportedOperationException: Can't deploy a standalone cluster.
at org.apache.flink.client.deployment.StandaloneClusterDescriptor.deploySessionCluster(StandaloneClusterDescriptor.java:57)
at org.apache.flink.client.deployment.StandaloneClusterDescriptor.deploySessionCluster(StandaloneClusterDescriptor.java:31)
at org.apache.flink.api.scala.FlinkShell$.deployNewYarnCluster(FlinkShell.scala:272)
at org.apache.flink.api.scala.FlinkShell$.fetchConnectionInfo(FlinkShell.scala:164)
at org.apache.flink.api.scala.FlinkShell$.liftedTree1$1(FlinkShell.scala:194)
at org.apache.flink.api.scala.FlinkShell$.startShell(FlinkShell.scala:193)
at org.apache.flink.api.scala.FlinkShell$.main(FlinkShell.scala:135)
at org.apache.flink.api.scala.FlinkShell.main(FlinkShell.scala)
Which version of flink do you use? If it is 1.5.0 there is known issue that scala shell does not work with flip-6 mode (enabled by default). You can try running it with legacy mode. There is already open JIRA FLINK-8795 for fixing it.

Loading solr configs in Cloudera SolrCloud

We try to import our data into SolrCloud using MapReduce batch indexing. We face a problem at the reduce phase, that solr.xml cannot be found. We create a 'twitter' collection but looking at the logs, after it failed to load in solr.xml, it uses the default one and tries to create 'collection1' (failed) and 'core1' (success) SolrCore. I'm not sure if we need to create our own solr.xml and where to put it (we try to put it at several places but it seems not to load in). Below is the log:
2022 [main] INFO org.apache.solr.hadoop.HeartBeater - Heart beat reporting class is org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl
2025 [main] INFO org.apache.solr.hadoop.SolrRecordWriter - Using this unpacked directory as solr home: /data/06/mapred/local/taskTracker/distcache/3866561797898787678_-1754062477_512745567/master.hadoop/tmp/9501daf9-5011-4665-bae3-d5af1c8bcd62.solr.zip
2025 [main] INFO org.apache.solr.hadoop.SolrRecordWriter - Creating embedded Solr server with solrHomeDir: /data/06/mapred/local/taskTracker/distcache/3866561797898787678_-1754062477_512745567/master.hadoop/tmp/9501daf9-5011-4665-bae3-d5af1c8bcd62.solr.zip, fs: DFS[DFSClient[clientName=DFSClient_NONMAPREDUCE_-1828461666_1, ugi=nguyen (auth:SIMPLE)]], outputShardDir: hdfs://master.hadoop:8020/user/nguyen/twitter/outdir/reducers/_temporary/_attempt_201311191613_0320_r_000014_0/part-r-00014
2029 [Thread-64] INFO org.apache.solr.hadoop.HeartBeater - HeartBeat thread running
2030 [Thread-64] INFO org.apache.solr.hadoop.HeartBeater - Issuing heart beat for 1 threads
2083 [main] INFO org.apache.solr.core.SolrResourceLoader - new SolrResourceLoader for directory: '/data/06/mapred/local/taskTracker/distcache/3866561797898787678_-1754062477_512745567/master.hadoop/tmp/9501daf9-5011-4665-bae3-d5af1c8bcd62.solr.zip/'
2259 [main] INFO org.apache.solr.hadoop.SolrRecordWriter - Constructed instance information solr.home /data/06/mapred/local/taskTracker/distcache/3866561797898787678_-1754062477_512745567/master.hadoop/tmp/9501daf9-5011-4665-bae3-d5af1c8bcd62.solr.zip (/data/06/mapred/local/taskTracker/distcache/3866561797898787678_-1754062477_512745567/master.hadoop/tmp/9501daf9-5011-4665-bae3-d5af1c8bcd62.solr.zip), instance dir /data/06/mapred/local/taskTracker/distcache/3866561797898787678_-1754062477_512745567/master.hadoop/tmp/9501daf9-5011-4665-bae3-d5af1c8bcd62.solr.zip/, conf dir /data/06/mapred/local/taskTracker/distcache/3866561797898787678_-1754062477_512745567/master.hadoop/tmp/9501daf9-5011-4665-bae3-d5af1c8bcd62.solr.zip/conf/, writing index to solr.data.dir hdfs://master.hadoop:8020/user/nguyen/twitter/outdir/reducers/_temporary/_attempt_201311191613_0320_r_000014_0/part-r-00014/data, with permdir hdfs://master.hadoop:8020/user/nguyen/twitter/outdir/reducers/_temporary/_attempt_201311191613_0320_r_000014_0/part-r-00014
2266 [main] INFO org.apache.solr.core.ConfigSolr - Loading container configuration from /data/06/mapred/local/taskTracker/distcache/3866561797898787678_-1754062477_512745567/master.hadoop/tmp/9501daf9-5011-4665-bae3-d5af1c8bcd62.solr.zip/solr.xml
2267 [main] INFO org.apache.solr.core.ConfigSolr - /data/06/mapred/local/taskTracker/distcache/3866561797898787678_-1754062477_512745567/master.hadoop/tmp/9501daf9-5011-4665-bae3-d5af1c8bcd62.solr.zip/solr.xml does not exist, using default configuration
2505 [main] INFO org.apache.solr.core.CoreContainer - New CoreContainer 696103669
2505 [main] INFO org.apache.solr.core.CoreContainer - Loading cores into CoreContainer [instanceDir=/data/06/mapred/local/taskTracker/distcache/3866561797898787678_-1754062477_512745567/master.hadoop/tmp/9501daf9-5011-4665-bae3-d5af1c8bcd62.solr.zip/]
2515 [main] INFO org.apache.solr.handler.component.HttpShardHandlerFactory - Setting socketTimeout to: 0
2515 [main] INFO org.apache.solr.handler.component.HttpShardHandlerFactory - Setting urlScheme to: http://
2515 [main] INFO org.apache.solr.handler.component.HttpShardHandlerFactory - Setting connTimeout to: 0
2515 [main] INFO org.apache.solr.handler.component.HttpShardHandlerFactory - Setting maxConnectionsPerHost to: 20
2516 [main] INFO org.apache.solr.handler.component.HttpShardHandlerFactory - Setting corePoolSize to: 0
2516 [main] INFO org.apache.solr.handler.component.HttpShardHandlerFactory - Setting maximumPoolSize to: 2147483647
2516 [main] INFO org.apache.solr.handler.component.HttpShardHandlerFactory - Setting maxThreadIdleTime to: 5
2516 [main] INFO org.apache.solr.handler.component.HttpShardHandlerFactory - Setting sizeOfQueue to: -1
2516 [main] INFO org.apache.solr.handler.component.HttpShardHandlerFactory - Setting fairnessPolicy to: false
2527 [main] INFO org.apache.solr.client.solrj.impl.HttpClientUtil - Creating new http client, config:maxConnectionsPerHost=20&maxConnections=10000&socketTimeout=0&connTimeout=0&retry=false
2648 [main] INFO org.apache.solr.logging.LogWatcher - Registering Log Listener [Log4j (org.slf4j.impl.Log4jLoggerFactory)]
2676 [coreLoadExecutor-3-thread-1] INFO org.apache.solr.core.CoreContainer - Creating SolrCore 'collection1' using instanceDir: /data/06/mapred/local/taskTracker/distcache/3866561797898787678_-1754062477_512745567/master.hadoop/tmp/9501daf9-5011-4665-bae3-d5af1c8bcd62.solr.zip/collection1
2677 [coreLoadExecutor-3-thread-1] INFO org.apache.solr.core.SolrResourceLoader - new SolrResourceLoader for directory: '/data/06/mapred/local/taskTracker/distcache/3866561797898787678_-1754062477_512745567/master.hadoop/tmp/9501daf9-5011-4665-bae3-d5af1c8bcd62.solr.zip/collection1/'
2691 [coreLoadExecutor-3-thread-1] ERROR org.apache.solr.core.CoreContainer - Failed to load file /data/06/mapred/local/taskTracker/distcache/3866561797898787678_-1754062477_512745567/master.hadoop/tmp/9501daf9-5011-4665-bae3-d5af1c8bcd62.solr.zip/collection1/solrconfig.xml
2693 [coreLoadExecutor-3-thread-1] ERROR org.apache.solr.core.CoreContainer - Unable to create core: collection1
org.apache.solr.common.SolrException: Could not load config for solrconfig.xml
at org.apache.solr.core.CoreContainer.createFromLocal(CoreContainer.java:596)
at org.apache.solr.core.CoreContainer.create(CoreContainer.java:661)
at org.apache.solr.core.CoreContainer$1.call(CoreContainer.java:368)
at org.apache.solr.core.CoreContainer$1.call(CoreContainer.java:360)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
at java.util.concurrent.FutureTask.run(FutureTask.java:166)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
at java.util.concurrent.FutureTask.run(FutureTask.java:166)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:724)
Caused by: java.io.IOException: Can't find resource 'solrconfig.xml' in classpath or '/data/06/mapred/local/taskTracker/distcache/3866561797898787678_-1754062477_512745567/master.hadoop/tmp/9501daf9-5011-4665-bae3-d5af1c8bcd62.solr.zip/collection1/conf/', cwd=/data/05/mapred/local/taskTracker/nguyen/jobcache/job_201311191613_0320/attempt_201311191613_0320_r_000014_0/work
at org.apache.solr.core.SolrResourceLoader.openResource(SolrResourceLoader.java:322)
at org.apache.solr.core.SolrResourceLoader.openConfig(SolrResourceLoader.java:287)
at org.apache.solr.core.Config.<init>(Config.java:116)
at org.apache.solr.core.Config.<init>(Config.java:86)
at org.apache.solr.core.SolrConfig.<init>(SolrConfig.java:120)
at org.apache.solr.core.CoreContainer.createFromLocal(CoreContainer.java:593)
... 11 more
2695 [coreLoadExecutor-3-thread-1] ERROR org.apache.solr.core.CoreContainer - null:org.apache.solr.common.SolrException: Unable to create core: collection1
at org.apache.solr.core.CoreContainer.recordAndThrow(CoreContainer.java:1158)
at org.apache.solr.core.CoreContainer.create(CoreContainer.java:670)
at org.apache.solr.core.CoreContainer$1.call(CoreContainer.java:368)
at org.apache.solr.core.CoreContainer$1.call(CoreContainer.java:360)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
at java.util.concurrent.FutureTask.run(FutureTask.java:166)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
at java.util.concurrent.FutureTask.run(FutureTask.java:166)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:724)
Caused by: org.apache.solr.common.SolrException: Could not load config for solrconfig.xml
at org.apache.solr.core.CoreContainer.createFromLocal(CoreContainer.java:596)
at org.apache.solr.core.CoreContainer.create(CoreContainer.java:661)
... 10 more
Caused by: java.io.IOException: Can't find resource 'solrconfig.xml' in classpath or '/data/06/mapred/local/taskTracker/distcache/3866561797898787678_-1754062477_512745567/master.hadoop/tmp/9501daf9-5011-4665-bae3-d5af1c8bcd62.solr.zip/collection1/conf/', cwd=/data/05/mapred/local/taskTracker/nguyen/jobcache/job_201311191613_0320/attempt_201311191613_0320_r_000014_0/work
at org.apache.solr.core.SolrResourceLoader.openResource(SolrResourceLoader.java:322)
at org.apache.solr.core.SolrResourceLoader.openConfig(SolrResourceLoader.java:287)
at org.apache.solr.core.Config.<init>(Config.java:116)
at org.apache.solr.core.Config.<init>(Config.java:86)
at org.apache.solr.core.SolrConfig.<init>(SolrConfig.java:120)
at org.apache.solr.core.CoreContainer.createFromLocal(CoreContainer.java:593)
... 11 more
2697 [main] INFO org.apache.solr.core.CoreContainer - Creating SolrCore 'core1' using instanceDir: /data/06/mapred/local/taskTracker/distcache/3866561797898787678_-1754062477_512745567/master.hadoop/tmp/9501daf9-5011-4665-bae3-d5af1c8bcd62.solr.zip
2697 [main] INFO org.apache.solr.core.SolrResourceLoader - new SolrResourceLoader for directory: '/data/06/mapred/local/taskTracker/distcache/3866561797898787678_-1754062477_512745567/master.hadoop/tmp/9501daf9-5011-4665-bae3-d5af1c8bcd62.solr.zip/'
2751 [main] INFO org.apache.solr.core.SolrConfig - Adding specified lib dirs to ClassLoader
2752 [main] WARN org.apache.solr.core.SolrResourceLoader - Can't find (or read) directory to add to classloader: ../../../contrib/extraction/lib (resolved as: /data/06/mapred/local/taskTracker/distcache/3866561797898787678_-1754062477_512745567/master.hadoop/tmp/9501daf9-5011-4665-bae3-d5af1c8bcd62.solr.zip/../../../contrib/extraction/lib).
2752 [main] WARN org.apache.solr.core.SolrResourceLoader - Can't find (or read) directory to add to classloader: ../../../dist/ (resolved as: /data/06/mapred/local/taskTracker/distcache/3866561797898787678_-1754062477_512745567/master.hadoop/tmp/9501daf9-5011-4665-bae3-d5af1c8bcd62.solr.zip/../../../dist).
2752 [main] WARN org.apache.solr.core.SolrResourceLoader - Can't find (or read) directory to add to classloader: ../../../contrib/clustering/lib/ (resolved as: /data/06/mapred/local/taskTracker/distcache/3866561797898787678_-1754062477_512745567/master.hadoop/tmp/9501daf9-5011-4665-bae3-d5af1c8bcd62.solr.zip/../../../contrib/clustering/lib).
2753 [main] WARN org.apache.solr.core.SolrResourceLoader - Can't find (or read) directory to add to classloader: ../../../dist/ (resolved as: /data/06/mapred/local/taskTracker/distcache/3866561797898787678_-1754062477_512745567/master.hadoop/tmp/9501daf9-5011-4665-bae3-d5af1c8bcd62.solr.zip/../../../dist).
2753 [main] WARN org.apache.solr.core.SolrResourceLoader - Can't find (or read) directory to add to classloader: ../../../contrib/langid/lib/ (resolved as: /data/06/mapred/local/taskTracker/distcache/3866561797898787678_-1754062477_512745567/master.hadoop/tmp/9501daf9-5011-4665-bae3-d5af1c8bcd62.solr.zip/../../../contrib/langid/lib).
2753 [main] WARN org.apache.solr.core.SolrResourceLoader - Can't find (or read) directory to add to classloader: ../../../dist/ (resolved as: /data/06/mapred/local/taskTracker/distcache/3866561797898787678_-1754062477_512745567/master.hadoop/tmp/9501daf9-5011-4665-bae3-d5af1c8bcd62.solr.zip/../../../dist).
2753 [main] WARN org.apache.solr.core.SolrResourceLoader - Can't find (or read) directory to add to classloader: ../../../contrib/velocity/lib (resolved as: /data/06/mapred/local/taskTracker/distcache/3866561797898787678_-1754062477_512745567/master.hadoop/tmp/9501daf9-5011-4665-bae3-d5af1c8bcd62.solr.zip/../../../contrib/velocity/lib).
2753 [main] WARN org.apache.solr.core.SolrResourceLoader - Can't find (or read) directory to add to classloader: ../../../dist/ (resolved as: /data/06/mapred/local/taskTracker/distcache/3866561797898787678_-1754062477_512745567/master.hadoop/tmp/9501daf9-5011-4665-bae3-d5af1c8bcd62.solr.zip/../../../dist).
2785 [main] INFO org.apache.solr.update.SolrIndexConfig - IndexWriter infoStream solr logging is enabled
2790 [main] INFO org.apache.solr.core.SolrConfig - Using Lucene MatchVersion: LUCENE_44
2869 [main] INFO org.apache.solr.core.Config - Loaded SolrConfig: solrconfig.xml
2879 [main] INFO org.apache.solr.schema.IndexSchema - Reading Solr Schema from schema.xml
2937 [main] INFO org.apache.solr.schema.IndexSchema - [core1] Schema name=twitter
3352 [main] INFO org.apache.solr.schema.IndexSchema - unique key field: id
3471 [main] INFO org.apache.solr.schema.FileExchangeRateProvider - Reloading exchange rates from file currency.xml
3478 [main] INFO org.apache.solr.schema.FileExchangeRateProvider - Reloading exchange rates from file currency.xml
3635 [main] INFO org.apache.solr.core.HdfsDirectoryFactory - Solr Kerberos Authentication disabled
3636 [main] INFO org.apache.solr.core.JmxMonitoredMap - No JMX servers found, not exposing Solr information with JMX.
3652 [main] INFO org.apache.solr.core.HdfsDirectoryFactory - creating directory factory for path hdfs://master.hadoop:8020/user/nguyen/twitter/outdir/reducers/_temporary/_attempt_201311191613_0320_r_000014_0/part-r-00014/data
3686 [main] INFO org.apache.solr.core.CachingDirectoryFactory - return new directory for hdfs://master.hadoop:8020/user/nguyen/twitter/outdir/reducers/_temporary/_attempt_201311191613_0320_r_000014_0/part-r-00014/data
3711 [main] WARN org.apache.solr.core.SolrCore - [core1] Solr index directory 'hdfs:/master.hadoop:8020/user/nguyen/twitter/outdir/reducers/_temporary/_attempt_201311191613_0320_r_000014_0/part-r-00014/data/index' doesn't exist. Creating new index...
3719 [main] INFO org.apache.solr.core.HdfsDirectoryFactory - creating directory factory for path hdfs://master.hadoop:8020/user/nguyen/twitter/outdir/reducers/_temporary/_attempt_201311191613_0320_r_000014_0/part-r-00014/data/index
3719 [main] INFO org.apache.solr.core.HdfsDirectoryFactory - Number of slabs of block cache [1] with direct memory allocation set to [true]
3720 [main] INFO org.apache.solr.core.HdfsDirectoryFactory - Block cache target memory usage, slab size of [134217728] will allocate [1] slabs and use ~[134217728] bytes
3721 [main] INFO org.apache.solr.store.blockcache.BufferStore - Initializing the 1024 buffers with [8192] buffers.
3740 [main] INFO org.apache.solr.store.blockcache.BufferStore - Initializing the 8192 buffers with [8192] buffers.
3891 [main] INFO org.apache.solr.core.CachingDirectoryFactory - return new directory for hdfs://master.hadoop:8020/user/nguyen/twitter/outdir/reducers/_temporary/_attempt_201311191613_0320_r_000014_0/part-r-00014/data/index
3988 [main] INFO org.apache.solr.update.LoggingInfoStream - [IFD][main]: init: current segments file is "null"; deletionPolicy=org.apache.solr.core.IndexDeletionPolicyWrapper#65b01d5d
3992 [main] INFO org.apache.solr.update.LoggingInfoStream - [IFD][main]: now checkpoint "" [0 segments ; isCommit = false]
3992 [main] INFO org.apache.solr.update.LoggingInfoStream - [IFD][main]: 0 msec to checkpoint
3992 [main] INFO org.apache.solr.update.LoggingInfoStream - [IW][main]: init: create=true
3992 [main] INFO org.apache.solr.update.LoggingInfoStream - [IW][main]:
dir=NRTCachingDirectory(org.apache.solr.store.hdfs.HdfsDirectory#17e5a6d8 lockFactory=org.apache.solr.store.hdfs.HdfsLockFactory#7f117668; maxCacheMB=192.0 maxMergeSizeMB=16.0)
solr looks for solr.home parameter and searchs solrConfig.xml file there. if there is none it tries to load default configuration.
it looks like your solr home is
/data/06/mapred/local/taskTracker/distcache/3866561797898787678_-1754062477_512745567/master.hadoop/tmp/9501daf9-5011-4665-bae3-d5af1c8bcd62.solr.zip/collection1/
check that folder for solrconfig.xml file
if there is none, copy one from example directory of solr
if there is one, match the file/folder permissions with the server instance

Resources