The taskmanager could not terminating and remained in a hung state, so the systemd considered that it was running and did not restart it
{"time":"2022-05-23 12:46:42.189","loglevel":"ERROR","class":"org.apache.flink.runtime.taskexecutor.TaskManagerRunner","message":"Fatal error occurred while executing the TaskManager. Shutting it down...","host":"flink15"}
org.apache.flink.util.FlinkRuntimeException: Task did not exit gracefully within 764 + seconds.
at org.apache.flink.runtime.taskmanager.Task$TaskCancelerWatchDog.run(Task.java:1791)
at java.lang.Thread.run(Thread.java:750)
{"time":"2022-05-23 12:46:52.201","loglevel":"ERROR","class":"org.apache.flink.runtime.taskexecutor.TaskManagerRunner","message":"Terminating TaskManagerRunner with exit code 1.","host":"flink15"}
org.apache.flink.util.FlinkException: Unexpected failure during runtime of TaskManagerRunner.
at org.apache.flink.runtime.taskexecutor.TaskManagerRunner.runTaskManager(TaskManagerRunner.java:394)
at org.apache.flink.runtime.taskexecutor.TaskManagerRunner.lambda$runTaskManagerProcessSecurely$3(TaskManagerRunner.java:428)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1762)
at org.apache.flink.runtime.security.contexts.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
at org.apache.flink.runtime.taskexecutor.TaskManagerRunner.runTaskManagerProcessSecurely(TaskManagerRunner.java:428)
at org.apache.flink.runtime.taskexecutor.TaskManagerRunner.runTaskManagerProcessSecurely(TaskManagerRunner.java:408)
at org.apache.flink.runtime.taskexecutor.TaskManagerRunner.main(TaskManagerRunner.java:366)
Caused by: java.util.concurrent.TimeoutException: null
at org.apache.flink.util.concurrent.FutureUtils$Timeout.run(FutureUtils.java:1237)
at org.apache.flink.util.concurrent.DirectExecutorService.execute(DirectExecutorService.java:217)
at org.apache.flink.util.concurrent.FutureUtils.lambda$orTimeout$15(FutureUtils.java:591)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750)
{"time":"2022-05-23 12:46:52.203","loglevel":"INFO","class":"org.apache.flink.runtime.blob.PermanentBlobCache","message":"Shutting down BLOB cache","host":"flink15"}
{"time":"2022-05-23 12:46:52.204","loglevel":"INFO","class":"org.apache.flink.runtime.blob.TransientBlobCache","message":"Shutting down BLOB cache","host":"flink15"}
{"time":"2022-05-23 12:46:52.205","loglevel":"INFO","class":"org.apache.flink.runtime.state.TaskExecutorLocalStateStoresManager","message":"Shutting down TaskExecutorLocalStateStoresManager.","host":"flink15"}
{"time":"2022-05-23 12:46:52.205","loglevel":"INFO","class":"org.apache.flink.runtime.filecache.FileCache","message":"removed file cache directory /var/lib/flink/tmp/flink-dist-cache-436fc230-8c11-45fd-83b2-43f0105dd524","host":"flink15"}
{"time":"2022-05-23 12:46:52.206","loglevel":"INFO","class":"org.apache.flink.runtime.io.disk.FileChannelManagerImpl","message":"FileChannelManager removed spill file directory /var/lib/flink/tmp/flink-io-f2b13865-538e-4369-9143-186d783aca9a","host":"flink15"}
{"time":"2022-05-23 12:46:52.206","loglevel":"INFO","class":"org.apache.flink.runtime.io.disk.FileChannelManagerImpl","message":"FileChannelManager removed spill file directory /var/lib/flink/tmp/flink-netty-shuffle-de83c8a0-9806-4a4e-b40e-b9b481a31f40","host":"flink15"}
but java process continued to run on the system, and when we manually restarted taskmanager systemd forced it to kill
May 23 15:20:48 flink15 systemd[1]: Stopping Apache Flink Taskmanager...
May 23 15:20:48 flink15 taskmanager.sh[10486]: Stopping taskexecutor daemon (pid: 60193) on host flink15.
May 23 15:20:58 flink15 taskmanager.sh[10486]: Daemon taskexecutor didn't stop within 10 seconds. Killing it.
May 23 15:21:01 flink15 systemd[1]: flink-taskmanager.service: Main process exited, code=killed, status=9/KILL
May 23 15:21:01 flink15 systemd[1]: Stopped Apache Flink Taskmanager.
May 23 15:21:01 flink15 systemd[1]: flink-taskmanager.service: Unit entered failed state.
is there any way to avoid this problem? so that taskmanager is guaranteed to exit in case of errors?
Related
Flink job submission
$ ./bin/flink run -m 10.0.2.4:6123 /streaming/mvn-flinkstreaming-scala/mvn-flinkstreaming-scala-1.0.jar
Stream processing!!!!!!!!!!!!!!!!!
org.apache.flink.streaming.api.datastream.DataStreamSink#40ef3420
------------------------------------------------------------
The program finished with the following exception:
org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: No pooled slot available and request to ResourceManager for new slot failed
at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)
at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)
at java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:607)
at java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:591)
... 31 more
Caused by: org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: No pooled slot available and request to ResourceManager for new slot failed
... 29 more
Caused by: java.util.concurrent.CompletionException: org.apache.flink.runtime.resourcemanager.exceptions.ResourceManagerException: Could not fulfill slot request
org.apache.flink.client.program.ProgramInvocationException: The main method caused an error: java.util.concurrent.ExecutionException: org.apache.flink.runtime.client.JobSubmissionException: Failed to submit JobGraph.
but when i checked the logs of the job in UI getting a different error,
Caused by: java.util.concurrent.CompletionException: org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: No pooled slot available and request to ResourceManager for new slot failed
at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)
at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)
at java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:607)
at java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:591)
... 31 more
Caused by: org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: No pooled slot available and request to ResourceManager for new slot failed
... 29 more
Caused by: java.util.concurrent.CompletionException: org.apache.flink.runtime.resourcemanager.exceptions.ResourceManagerException: Could not fulfill slot request ea
What should I check, my configuration parameter are as below,
A) is the -m ip_address:6123 right option or 8081 should be the port...
config..
# Note this accounts for all memory usage within the TaskManager process, including JVM metaspace and other overhead.
taskmanager.memory.process.size: 1568m
# To exclude JVM metaspace and overhead, please, use total Flink memory size instead of 'taskmanager.memory.process.size'.
# It is not recommended to set both 'taskmanager.memory.process.size' and Flink memory.
#
# taskmanager.memory.flink.size: 1280m
# The number of task slots that each TaskManager offers. Each slot runs one parallel pipeline.
taskmanager.numberOfTaskSlots: 2
# The parallelism used for programs that did not specify and other parallelism.
parallelism.default: 2
Cluster starting,
$ bin/start-cluster.sh
Starting cluster.
Starting standalonesession daemon on host centos1.
Starting taskexecutor daemon on host centos2.
Starting taskexecutor daemon on host centos3.
Able to process grep in master node,
]$ ps -ef | grep flink
root 12300 1 10 07:22 pts/0 00:00:05 java -Xms16384m -Xmx16384m -Dlog.file=/storage/flink-1.10.0/log/
Unable to find a process of fink related to task managers,
centos2 ~]$ psg flink
Is this a right state?
I have faced the same problem in Flink-1.10.0. So please make sure you have enough memory according to the data load.
Error which i was getting:
java.lang.OutOfMemoryError: Metaspace
jobmanager.heap.size: 1024m (Default)
taskmanager.memory.flink.size: 1280m (Default)
taskmanager.memory.jvm-metaspace.size: 256m (Default)
So I have increased taskmanager.memory.jvm-metaspace.size according to data load and it solved my issue.
For more details click here.
I've encountered this problem before, which is most likely an indicator for insufficient memory on your flink cluster. The different error messages also make sense, as they relate to each other.
Check your
jobmanager.heap.size
and
taskmanager.heap.size
in the config, increase them to a rather oversized amount and you should not see this error anymore. From here you can finetune the actual memory settings
I am trying to setup Apache Flink standalone cluster consisting of 2 master nodes and one worker node. Using Flink 1.6 and Zookeeper. To start and stop cluster I used process described in Flink's 1.6 documentation, i.e. to start cluster I ran start-zookeeper-quorum.sh and then start-cluster.sh
and to stop cluster I ran stop-cluster.sh
After running one job (which failed), then stopping and restarting cluster again I noticed error where none of 2 the job managers could start because they are looking for directory job_e44fdee88a931200953fed45883ee3f1 which does not exist (I am assuming this is directory for my failed job, but not sure)
How do I recover cluster from this error?
2018-09-06 14:58:04,065 ERROR org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Fatal error occurred in the cluster entrypoint.
java.lang.RuntimeException: org.apache.flink.runtime.client.JobExecutionException: Could not set up JobManager
at org.apache.flink.util.ExceptionUtils.rethrow(ExceptionUtils.java:199)
at org.apache.flink.util.function.ConsumerWithException.accept(ConsumerWithException.java:40)
at org.apache.flink.runtime.dispatcher.Dispatcher.lambda$waitForTerminatingJobManager$29(Dispatcher.java:820)
at java.util.concurrent.CompletableFuture.uniRun(CompletableFuture.java:705)
at java.util.concurrent.CompletableFuture$UniRun.tryFire(CompletableFuture.java:687)
at java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:442)
at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:332)
at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:158)
at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:70)
at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.onReceive(AkkaRpcActor.java:142)
at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.onReceive(FencedAkkaRpcActor.java:40)
at akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:165)
at akka.actor.Actor$class.aroundReceive(Actor.scala:502)
at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:95)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)
at akka.actor.ActorCell.invoke(ActorCell.scala:495)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)
at akka.dispatch.Mailbox.run(Mailbox.scala:224)
at akka.dispatch.Mailbox.exec(Mailbox.scala:234)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
Caused by: org.apache.flink.runtime.client.JobExecutionException: Could not set up JobManager
at org.apache.flink.runtime.jobmaster.JobManagerRunner.<init>(JobManagerRunner.java:176)
at org.apache.flink.runtime.dispatcher.Dispatcher$DefaultJobManagerRunnerFactory.createJobManagerRunner(Dispatcher.java:936)
at org.apache.flink.runtime.dispatcher.Dispatcher.createJobManagerRunner(Dispatcher.java:291)
at org.apache.flink.runtime.dispatcher.Dispatcher.runJob(Dispatcher.java:281)
at org.apache.flink.util.function.ConsumerWithException.accept(ConsumerWithException.java:38)
:
... 21 more
Caused by: java.lang.Exception: Cannot set up the user code libraries: /hastorage/default/blob/job_e44fdee88a931200953fed45883ee3f1/blob_p-f655414c973995e93709acbd22c1c162c9c43a98-75bd4e71882f988a6c337222efadba7b (No such file or directory)
at org.apache.flink.runtime.jobmaster.JobManagerRunner.<init>(JobManagerRunner.java:134)
... 25 more
Caused by: java.io.FileNotFoundException: /hastorage/default/blob/job_e44fdee88a931200953fed45883ee3f1/blob_p-f655414c973995e93709acbd22c1c162c9c43a98-75bd4e71882f988a6c337222efadba7b (No such file or directory)
at java.io.FileInputStream.open0(Native Method)
at java.io.FileInputStream.open(FileInputStream.java:195)
at java.io.FileInputStream.<init>(FileInputStream.java:138)
at org.apache.flink.core.fs.local.LocalDataInputStream.<init>(LocalDataInputStream.java:50)
at org.apache.flink.core.fs.local.LocalFileSystem.open(LocalFileSystem.java:142)
at org.apache.flink.runtime.blob.FileSystemBlobStore.get(FileSystemBlobStore.java:102)
at org.apache.flink.runtime.blob.FileSystemBlobStore.get(FileSystemBlobStore.java:84)
at org.apache.flink.runtime.blob.BlobServer.getFileInternal(BlobServer.java:493)
at org.apache.flink.runtime.blob.BlobServer.getFileInternal(BlobServer.java:444)
at org.apache.flink.runtime.blob.BlobServer.getFile(BlobServer.java:417)
at org.apache.flink.runtime.execution.librarycache.BlobLibraryCacheManager.registerTask(BlobLibraryCacheManager.java:120)
at org.apache.flink.runtime.execution.librarycache.BlobLibraryCacheManager.registerJob(BlobLibraryCacheManager.java:91)
at org.apache.flink.runtime.jobmaster.JobManagerRunner.<init>(JobManagerRunner.java:131)
... 25 more
2018-09-06 14:58:04,069 INFO org.apache.flink.runtime.blob.TransientBlobCache - Shutting down BLOB cache
The problem you are observing is caused by a bug in Flink. You can find more details about the problem here. The problem will be fixed with the next bug fix release.
I'm running Apache Flink 1.4.2 (Without bundled hadoop) on Mac, and I've managed to run the WordCount example. However, when I try to run the Gelly example following instructions in Runnung Gelly Examples , I stumble into the following error:
java.lang.ClassNotFoundException: org.apache.flink.graph.generator.random.BlockInfo
May I know how can I fix it?
I first started the cluster by
./bin/start-cluster.sh
Starting cluster.
[INFO] 1 instance(s) of jobmanager are already running on myhostname.
Starting jobmanager daemon on host myhostname.
[INFO] 1 instance(s) of taskmanager are already running on myhostname.
Starting taskmanager daemon on host myhostname.
Then in another terminal, I run
./bin/flink run examples/gelly/flink-gelly-examples_2.11-1.4.2.jar --algorithm GraphMetrics --order directed --input RMatGraph --type integer --scale 20 --simplify directed --output print
and the following errors
Cluster configuration: Standalone cluster with JobManager at localhost/127.0.0.1:6123
Using address localhost:6123 to connect to JobManager.
JobManager web interface address http://localhost:8081
Starting execution of program
Submitting job with JobID: 099acb25fa34be1fe63fb47296605f69. Waiting for job completion.
Connected to JobManager at Actor[akka.tcp://flink#localhost:6123/user/jobmanager#57507119] with leader session id 00000000-0000-0000-0000-000000000000.
------------------------------------------------------------
The program finished with the following exception:
org.apache.flink.client.program.ProgramInvocationException: The program execution failed: Failed to submit job 099acb25fa34be1fe63fb47296605f69 (RMatGraph (s20e16d) ⇨ GraphMetrics ⇨ Hash [integer])
at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:492)
at org.apache.flink.client.program.StandaloneClusterClient.submitJob(StandaloneClusterClient.java:105)
at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:456)
at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:444)
at org.apache.flink.client.program.ContextEnvironment.execute(ContextEnvironment.java:62)
at org.apache.flink.graph.Runner.execute(Runner.java:452)
at org.apache.flink.graph.Runner.main(Runner.java:507)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:525)
at org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:417)
at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:396)
at org.apache.flink.client.CliFrontend.executeProgram(CliFrontend.java:802)
at org.apache.flink.client.CliFrontend.run(CliFrontend.java:282)
at org.apache.flink.client.CliFrontend.parseParameters(CliFrontend.java:1054)
at org.apache.flink.client.CliFrontend$1.call(CliFrontend.java:1101)
at org.apache.flink.client.CliFrontend$1.call(CliFrontend.java:1098)
at org.apache.flink.runtime.security.NoOpSecurityContext.runSecured(NoOpSecurityContext.java:30)
at org.apache.flink.client.CliFrontend.main(CliFrontend.java:1098)
Caused by: org.apache.flink.runtime.client.JobExecutionException: Failed to submit job 099acb25fa34be1fe63fb47296605f69 (RMatGraph (s20e16d) ⇨ GraphMetrics ⇨ Hash [integer])
at org.apache.flink.runtime.jobmanager.JobManager.org$apache$flink$runtime$jobmanager$JobManager$$submitJob(JobManager.scala:1325)
at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1.applyOrElse(JobManager.scala:447)
at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
at org.apache.flink.runtime.LeaderSessionMessageFilter$$anonfun$receive$1.applyOrElse(LeaderSessionMessageFilter.scala:38)
at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
at org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:33)
at org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:28)
at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123)
at org.apache.flink.runtime.LogMessages$$anon$1.applyOrElse(LogMessages.scala:28)
at akka.actor.Actor$class.aroundReceive(Actor.scala:502)
at org.apache.flink.runtime.jobmanager.JobManager.aroundReceive(JobManager.scala:122)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)
at akka.actor.ActorCell.invoke(ActorCell.scala:495)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)
at akka.dispatch.Mailbox.run(Mailbox.scala:224)
at akka.dispatch.Mailbox.exec(Mailbox.scala:234)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
Caused by: java.lang.ClassNotFoundException: org.apache.flink.graph.generator.random.BlockInfo
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at org.apache.flink.runtime.execution.librarycache.FlinkUserCodeClassLoaders$ChildFirstClassLoader.loadClass(FlinkUserCodeClassLoaders.java:115)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
at org.apache.flink.util.InstantiationUtil$ClassLoaderObjectInputStream.resolveClass(InstantiationUtil.java:73)
at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1859)
at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1745)
at java.io.ObjectInputStream.readClass(ObjectInputStream.java:1710)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1550)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:427)
at java.util.HashSet.readObject(HashSet.java:341)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1158)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2169)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2060)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1567)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2278)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2202)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2060)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1567)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:427)
at org.apache.flink.util.InstantiationUtil.deserializeObject(InstantiationUtil.java:437)
at org.apache.flink.util.InstantiationUtil.deserializeObject(InstantiationUtil.java:424)
at org.apache.flink.util.InstantiationUtil.deserializeObject(InstantiationUtil.java:412)
at org.apache.flink.util.SerializedValue.deserializeValue(SerializedValue.java:58)
at org.apache.flink.runtime.jobmanager.JobManager.org$apache$flink$runtime$jobmanager$JobManager$$submitJob(JobManager.scala:1253)
... 19 more
This particular error shows up because when you have copied the jar files to the lib/ folder, but didn't stop and start the flink cluster against after doing the copy (i.e. you had the cluster running before you performed the copy).
If you stop and start the cluster using ./bin/stop-cluster.sh and ./bin/start-cluster.sh after copying the jar files to the lib folder as mentioned on this page the problem gets fixed.
Also that's probably why you didn't get the error with new installation because you copied the jar files to the lib folder before starting the flink cluster.
I'm coding three Batch Applications, with the Python API, however, my third application is dealing with some Exceptions, especially when I increase the parallelism. This application has a Cross transformation inside it.
The cluster has 4 VMs, with 4 cpu cores and 7GB RAM each machine. So, the max parallelism setted 16.
The exception is:
org.apache.flink.client.program.ProgramInvocationException: The program execution failed: Job execution failed.
atorg.apache.flink.client.program.ClusterClient.run(ClusterClient.java:427)
at org.apache.flink.client.program.StandaloneClusterClient.submitJob(StandaloneClusterClient.java:101)
at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:400)
at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:387)
at org.apache.flink.client.program.ContextEnvironment.execute(ContextEnvironment.java:62)
at org.apache.flink.api.java.ExecutionEnvironment.execute(ExecutionEnvironment.java:926)
at org.apache.flink.python.api.PythonPlanBinder.runPlan(PythonPlanBinder.java:149)
at org.apache.flink.python.api.PythonPlanBinder.main(PythonPlanBinder.java:114)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:528)
at org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:419)
at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:339)
at org.apache.flink.client.CliFrontend.executeProgram(CliFrontend.java:831)
at org.apache.flink.client.CliFrontend.run(CliFrontend.java:256)
at org.apache.flink.client.CliFrontend.parseParameters(CliFrontend.java:1073)
at org.apache.flink.client.CliFrontend$2.call(CliFrontend.java:1120)
at org.apache.flink.client.CliFrontend$2.call(CliFrontend.java:1117)
at org.apache.flink.runtime.security.HadoopSecurityContext$1.run(HadoopSecurityContext.java:43)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
at org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:40)
at org.apache.flink.client.CliFrontend.main(CliFrontend.java:1116)
Caused by: org.apache.flink.runtime.client.JobExecutionException: Job execution failed.
at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1$$anonfun$applyOrElse$7.apply$mcV$sp(JobManager.scala:900)
at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1$$anonfun$applyOrElse$7.apply(JobManager.scala:843)
at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1$$anonfun$applyOrElse$7.apply(JobManager.scala:843)
at scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
at scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:40)
at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:397)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
Caused by: java.io.EOFException
at java.io.DataInputStream.readInt(DataInputStream.java:392)
at org.apache.flink.python.api.streaming.data.PythonStreamer.streamBufferWithoutGroups(PythonStreamer.java:252)
at org.apache.flink.python.api.functions.PythonMapPartition.mapPartition(PythonMapPartition.java:54)
at org.apache.flink.runtime.operators.MapPartitionDriver.run(MapPartitionDriver.java:103)
at org.apache.flink.runtime.operators.BatchTask.run(BatchTask.java:490)
at org.apache.flink.runtime.operators.BatchTask.invoke(BatchTask.java:355)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:655)
at java.lang.Thread.run(Thread.java:745)
I've got some evidences:
1) It's not at all deterministic, eventually, the application finished without errors.
2) With parallelism <= 4 the algorithm finish flawlessly, exception often happens with higher parallelism degrees.
3) Even entering small inputs (~20MB) the exception occurr.
4) I didn't find any error at the slaves logs (jobmanager), only at the Master Flink Taskmanager
Can you help to find out the responsible for this? I can get more logs (namenode, datanode, jobmanager, taskmanager) if necessary.
Their is a similar posting that claims to have the answer but I'm still getting the error after putting step -> stepwise=false in my camel route. -> exception
14:35:33,649 WARN [org.apache.camel.component.file.remote.RemoteFilePollingConsumerPollStrategy] (Camel (camel-1) thread #13 - sftp://myftp:22) Trying to recover by discon
necting from remote server forcing a re-connect at next poll: sftp://myUser#myftp:22
14:35:33,654 WARN [org.apache.camel.component.file.remote.SftpConsumer] (Camel (camel-1) thread #13 - sftp://myftp:22) Consumer Consumer[sftp://myftp:22?delay
=1h&delete=true&doneFileName=done&password=xxxxxx&sortBy=ignoreCase%3Afile%3Aname&stepwise=false&username=myUser] failed polling endpoint: Endpoint[sftp://myftp:22?delay=1h
&delete=true&doneFileName=done&password=xxxxxx&sortBy=ignoreCase%3Afile%3Aname&stepwise=false&username=myUser]. Will try again at next poll. Caused by: [org.apache.camel.component.file.
GenericFileOperationFailedException - Cannot list directory: .]: org.apache.camel.component.file.GenericFileOperationFailedException: Cannot list directory: .
at org.apache.camel.component.file.remote.SftpOperations.listFiles(SftpOperations.java:583) [camel-ftp-2.13.2.jar:2.13.2]
at org.apache.camel.component.file.remote.SftpConsumer.doPollDirectory(SftpConsumer.java:90) [camel-ftp-2.13.2.jar:2.13.2]
at org.apache.camel.component.file.remote.SftpConsumer.pollDirectory(SftpConsumer.java:52) [camel-ftp-2.13.2.jar:2.13.2]
at org.apache.camel.component.file.GenericFileConsumer.poll(GenericFileConsumer.java:119) [camel-core-2.13.2.jar:2.13.2]
at org.apache.camel.impl.ScheduledPollConsumer.doRun(ScheduledPollConsumer.java:187) [camel-core-2.13.2.jar:2.13.2]
at org.apache.camel.impl.ScheduledPollConsumer.run(ScheduledPollConsumer.java:114) [camel-core-2.13.2.jar:2.13.2]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) [rt.jar:1.7.0_65]
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:304) [rt.jar:1.7.0_65]
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178) [rt.jar:1.7.0_65]
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) [rt.jar:1.7.0_65]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) [rt.jar:1.7.0_65]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [rt.jar:1.7.0_65]
at java.lang.Thread.run(Thread.java:745) [rt.jar:1.7.0_65]
Caused by: 4:
at com.jcraft.jsch.ChannelSftp.ls(ChannelSftp.java:1660) [jsch-0.1.49.jar:]
at com.jcraft.jsch.ChannelSftp.ls(ChannelSftp.java:1466) [jsch-0.1.49.jar:]
at org.apache.camel.component.file.remote.SftpOperations.listFiles(SftpOperations.java:574) [camel-ftp-2.13.2.jar:2.13.2]
... 12 more
Caused by: java.io.IOException: Pipe closed
at java.io.PipedInputStream.read(PipedInputStream.java:308) [rt.jar:1.7.0_65]
at com.jcraft.jsch.Channel$MyPipedInputStream.updateReadSide(Channel.java:344) [jsch-0.1.49.jar:]
at com.jcraft.jsch.ChannelSftp.ls(ChannelSftp.java:1483) [jsch-0.1.49.jar:]
... 14 more
here is my route ->
sftp://myftp:22?delay
=1h&delete=true&doneFileName=done&password=xxxxxx&sortBy=ignoreCase%3Afile%3Aname&stepwise=false&username=myUser
Yeah, there is java.io.IOException: Pipe closed.
I just checked the code of camel-ftp, it has the code to check the connection, but it's hard to know if the connection is still opened if we don't send some bytes to server socket.
The solution could be force the ftp client send some ping message to check if the connection is still open.
If you want to set a very long delay for your ftp pulling, the work around way is setting disconnect option to be true to force ftp endpoint restart a new connection when it pull the directory again from the FTP server.
I just created a JIRA for it.