Nodetool Rebuild :- Stream Failing afer sometime - database
I have added a new node to my existing single node cassandra cluster. It has around 48gb of data.
There is only one keyspace responsible for that and it has a replication factor of '2'(I changed it after adding the new node). I am trying to run nodetool rebuild on the new node so data can be streamed to it from the seed node.
The stream ended after transferring 36gb of data and the node went down. So I repeated the process but the stream keeps on failing after transferring some data (12-25 gb).
It ends with the following error.
error: Error while rebuilding node: Stream failed
-- StackTrace --
java.lang.RuntimeException: Error while rebuilding node: Stream failed
at org.apache.cassandra.service.StorageService.rebuild(StorageService.java:1319)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at sun.reflect.misc.Trampoline.invoke(MethodUtil.java:71)
at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at sun.reflect.misc.MethodUtil.invoke(MethodUtil.java:275)
at com.sun.jmx.mbeanserver.StandardMBeanIntrospector.invokeM2(StandardMBeanIntrospector.java:112)
at com.sun.jmx.mbeanserver.StandardMBeanIntrospector.invokeM2(StandardMBeanIntrospector.java:46)
at com.sun.jmx.mbeanserver.MBeanIntrospector.invokeM(MBeanIntrospector.java:237)
at com.sun.jmx.mbeanserver.PerInterface.invoke(PerInterface.java:138)
at com.sun.jmx.mbeanserver.MBeanSupport.invoke(MBeanSupport.java:252)
at com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.invoke(DefaultMBeanServerInterceptor.java:819)
at com.sun.jmx.mbeanserver.JmxMBeanServer.invoke(JmxMBeanServer.java:801)
at javax.management.remote.rmi.RMIConnectionImpl.doOperation(RMIConnectionImpl.java:1468)
at javax.management.remote.rmi.RMIConnectionImpl.access$300(RMIConnectionImpl.java:76)
at javax.management.remote.rmi.RMIConnectionImpl$PrivilegedOperation.run(RMIConnectionImpl.java:1309)
at javax.management.remote.rmi.RMIConnectionImpl.doPrivilegedOperation(RMIConnectionImpl.java:1401)
at javax.management.remote.rmi.RMIConnectionImpl.invoke(RMIConnectionImpl.java:829)
at sun.reflect.GeneratedMethodAccessor6.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at sun.rmi.server.UnicastServerRef.dispatch(UnicastServerRef.java:357)
at sun.rmi.transport.Transport$1.run(Transport.java:200)
at sun.rmi.transport.Transport$1.run(Transport.java:197)
at java.security.AccessController.doPrivileged(Native Method)
at sun.rmi.transport.Transport.serviceCall(Transport.java:196)
at sun.rmi.transport.tcp.TCPTransport.handleMessages(TCPTransport.java:573)
at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run0(TCPTransport.java:834)
at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.lambda$run$0(TCPTransport.java:688)
at java.security.AccessController.doPrivileged(Native Method)
at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run(TCPTransport.java:687)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
P.S. I have made sure that the streaming_socket_timeout_in_ms is set to at least 24 hours.
Kindly help me out here guys.
Thanks.
Update :-
I ran nodetool rebuild keyspace_name instead of nodetool rebuild and it ended with this error again.
WARN [StreamReceiveTask:9] 2019-10-23 11:14:41,522 StreamResultFuture.java:214 - [Stream #b9b051b0-f580-11e9-92dd-9765711f899a] Stream failed
ERROR [RMI TCP Connection(12)-10.128.1.3] 2019-10-23 11:14:42,316 StorageService.java:1318 - Error while rebuilding node
org.apache.cassandra.streaming.StreamException: Stream failed
at org.apache.cassandra.streaming.StreamResultFuture.maybeComplete(StreamResultFuture.java:215) ~[apache-cassandra-3.11.4.jar:3.11.4]
at org.apache.cassandra.streaming.StreamResultFuture.handleSessionComplete(StreamResultFuture.java:191) ~[apache-cassandra-3.11.4.jar:3.11.4]
at org.apache.cassandra.streaming.StreamSession.closeSession(StreamSession.java:481) ~[apache-cassandra-3.11.4.jar:3.11.4]
at org.apache.cassandra.streaming.StreamSession.onError(StreamSession.java:571) ~[apache-cassandra-3.11.4.jar:3.11.4]
at org.apache.cassandra.streaming.StreamReceiveTask$OnCompletionRunnable.run(StreamReceiveTask.java:281) ~[apache-cassandra-3.11.4.jar:3.11.4]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) ~[na:1.8.0_222]
at java.util.concurrent.FutureTask.run(FutureTask.java:266) ~[na:1.8.0_222]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[na:1.8.0_222]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[na:1.8.0_222]
at org.apache.cassandra.concurrent.NamedThreadFactory.lambda$threadLocalDeallocator$0(NamedThreadFactory.java:81) ~[apache-cassandra-3.11.4.jar:3.11.4]
at java.lang.Thread.run(Thread.java:748) ~[na:1.8.0_222]
INFO [Service Thread] 2019-10-23 11:14:43,223 GCInspector.java:284 - ConcurrentMarkSweep GC in 310ms. CMS Old Gen: 2391324840 -> 639245216; Code Cache: 38320192 -> 38627904; Compressed Class Space: 554$
ERROR [STREAM-IN-/10.128.1.1:7000] 2019-10-23 11:14:48,769 StreamSession.java:593 - [Stream #b9b051b0-f580-11e9-92dd-9765711f899a] Streaming error occurred on session with peer 10.128.1.1
java.lang.RuntimeException: Outgoing stream handler has been closed
at org.apache.cassandra.streaming.ConnectionHandler.sendMessage(ConnectionHandler.java:143) ~[apache-cassandra-3.11.4.jar:3.11.4]
at org.apache.cassandra.streaming.StreamSession.receive(StreamSession.java:655) ~[apache-cassandra-3.11.4.jar:3.11.4]
at org.apache.cassandra.streaming.StreamSession.messageReceived(StreamSession.java:523) ~[apache-cassandra-3.11.4.jar:3.11.4]
at org.apache.cassandra.streaming.ConnectionHandler$IncomingMessageHandler.run(ConnectionHandler.java:317) ~[apache-cassandra-3.11.4.jar:3.11.4]
at java.lang.Thread.run(Thread.java:748) [na:1.8.0_222]
Update 2 :-
I tried to do nodetool rebuild again on a fresh node
The stream fails again after transfering around 95% of data.
This is the log of streaming node
INFO [STREAM-INIT-/10.128.1.3:56486] 2019-10-23 11:16:03,497 StreamResultFuture.java:116 - [Stream #80136bd0-f586-11e9-92dd-9765711f899a ID#0] Creating new streaming plan for Rebuild
INFO [STREAM-INIT-/10.128.1.3:56486] 2019-10-23 11:16:03,498 StreamResultFuture.java:123 - [Stream #80136bd0-f586-11e9-92dd-9765711f899a, ID#0] Received streaming plan for Rebuild
INFO [STREAM-INIT-/10.128.1.3:56488] 2019-10-23 11:16:03,498 StreamResultFuture.java:123 - [Stream #80136bd0-f586-11e9-92dd-9765711f899a, ID#0] Received streaming plan for Rebuild
INFO [STREAM-IN-/10.128.1.3:56488] 2019-10-23 11:16:03,600 StreamResultFuture.java:173 - [Stream #80136bd0-f586-11e9-92dd-9765711f899a ID#0] Prepare completed. Receiving 0 files(0.000KiB), sending 133 f$
INFO [Service Thread] 2019-10-23 11:19:14,472 GCInspector.java:284 - ParNew GC in 517ms. CMS Old Gen: 104131728 -> 121315352; Par Eden Space: 1342177280 -> 0; Par Survivor Space: 67963984 -> 61263088
ERROR [STREAM-IN-/10.128.1.3:56488] 2019-10-23 11:56:43,902 StreamSession.java:706 - [Stream #80136bd0-f586-11e9-92dd-9765711f899a] Remote peer 10.128.1.3 failed stream session.
INFO [IndexSummaryManager:1] 2019-10-23 11:58:32,284 IndexSummaryRedistribution.java:77 - Redistributing index summaries
INFO [STREAM-IN-/10.128.1.3:56488] 2019-10-23 11:59:38,687 StreamResultFuture.java:187 - [Stream #80136bd0-f586-11e9-92dd-9765711f899a] Session with /10.128.1.3 is complete
ERROR [STREAM-OUT-/10.128.1.3:56486] 2019-10-23 11:59:38,688 StreamSession.java:593 - [Stream #80136bd0-f586-11e9-92dd-9765711f899a] Streaming error occurred on session with peer 10.128.1.3
java.lang.RuntimeException: Transfer of file /var/lib/cassandra/data/thingsboard/ts_kv_cf-53b7bf3096ec11e99154356269723c5c/md-583-big-Data.db already completed or aborted (perhaps session failed?).
at org.apache.cassandra.streaming.messages.OutgoingFileMessage.startTransfer(OutgoingFileMessage.java:119) ~[apache-cassandra-3.11.4.jar:3.11.4]
at org.apache.cassandra.streaming.messages.OutgoingFileMessage$1.serialize(OutgoingFileMessage.java:49) ~[apache-cassandra-3.11.4.jar:3.11.4]
at org.apache.cassandra.streaming.messages.OutgoingFileMessage$1.serialize(OutgoingFileMessage.java:41) ~[apache-cassandra-3.11.4.jar:3.11.4]
at org.apache.cassandra.streaming.messages.StreamMessage.serialize(StreamMessage.java:50) ~[apache-cassandra-3.11.4.jar:3.11.4]
at org.apache.cassandra.streaming.ConnectionHandler$OutgoingMessageHandler.sendMessage(ConnectionHandler.java:408) ~[apache-cassandra-3.11.4.jar:3.11.4]
at org.apache.cassandra.streaming.ConnectionHandler$OutgoingMessageHandler.run(ConnectionHandler.java:380) ~[apache-cassandra-3.11.4.jar:3.11.4]
at java.lang.Thread.run(Thread.java:748) [na:1.8.0_222]
WARN [STREAM-IN-/10.128.1.3:56488] 2019-10-23 11:59:38,688 StreamResultFuture.java:214 - [Stream #80136bd0-f586-11e9-92dd-9765711f899a] Stream failed
INFO [STREAM-INIT-/10.128.1.3:56674] 2019-10-23 12:03:24,860 StreamResultFuture.java:116 - [Stream #1da92910-f58d-11e9-92dd-9765711f899a ID#0] Creating new streaming plan for Rebuild
INFO [STREAM-INIT-/10.128.1.3:56674] 2019-10-23 12:03:24,861 StreamResultFuture.java:123 - [Stream #1da92910-f58d-11e9-92dd-9765711f899a, ID#0] Received streaming plan for Rebuild
INFO [STREAM-INIT-/10.128.1.3:56676] 2019-10-23 12:03:24,861 StreamResultFuture.java:123 - [Stream #1da92910-f58d-11e9-92dd-9765711f899a, ID#0] Received streaming plan for Rebuild
INFO [STREAM-IN-/10.128.1.3:56676] 2019-10-23 12:03:24,950 StreamResultFuture.java:173 - [Stream #1da92910-f58d-11e9-92dd-9765711f899a ID#0] Prepare completed. Receiving 0 files(0.000KiB), sending 133 f$
INFO [Service Thread] 2019-10-23 12:04:18,160 GCInspector.java:284 - ParNew GC in 307ms. CMS Old Gen: 124972984 -> 125070416; Par Eden Space: 1342177280 -> 0; Par Survivor Space: 61042328 -> 82423296
INFO [GossipStage:1] 2019-10-23 12:27:39,200 Gossiper.java:1026 - InetAddress /10.128.1.3 is now DOWN
INFO [HANDSHAKE-/10.128.1.3] 2019-10-23 12:27:39,424 OutboundTcpConnection.java:561 - Handshaking version with /10.128.1.3
ERROR [STREAM-IN-/10.128.1.3:56676] 2019-10-23 12:27:45,107 StreamSession.java:593 - [Stream #1da92910-f58d-11e9-92dd-9765711f899a] Streaming error occurred on session with peer 10.128.1.3
java.net.SocketException: End-of-stream reached
at org.apache.cassandra.streaming.messages.StreamMessage.deserialize(StreamMessage.java:71) ~[apache-cassandra-3.11.4.jar:3.11.4]
at org.apache.cassandra.streaming.ConnectionHandler$IncomingMessageHandler.run(ConnectionHandler.java:311) ~[apache-cassandra-3.11.4.jar:3.11.4]
at java.lang.Thread.run(Thread.java:748) [na:1.8.0_222]
INFO [STREAM-IN-/10.128.1.3:56676] 2019-10-23 12:27:45,108 StreamResultFuture.java:187 - [Stream #1da92910-f58d-11e9-92dd-9765711f899a] Session with /10.128.1.3 is complete
ERROR [STREAM-OUT-/10.128.1.3:56674] 2019-10-23 12:27:45,108 StreamSession.java:593 - [Stream #1da92910-f58d-11e9-92dd-9765711f899a] Streaming error occurred on session with peer 10.128.1.3
org.apache.cassandra.io.FSReadError: java.io.IOException: Broken pipe
at org.apache.cassandra.io.util.ChannelProxy.transferTo(ChannelProxy.java:145) ~[apache-cassandra-3.11.4.jar:3.11.4]
at org.apache.cassandra.streaming.compress.CompressedStreamWriter.lambda$write$0(CompressedStreamWriter.java:85) ~[apache-cassandra-3.11.4.jar:3.11.4]
at org.apache.cassandra.io.util.BufferedDataOutputStreamPlus.applyToChannel(BufferedDataOutputStreamPlus.java:350) ~[apache-cassandra-3.11.4.jar:3.11.4]
at org.apache.cassandra.streaming.compress.CompressedStreamWriter.write(CompressedStreamWriter.java:85) ~[apache-cassandra-3.11.4.jar:3.11.4]
at org.apache.cassandra.streaming.messages.OutgoingFileMessage.serialize(OutgoingFileMessage.java:101) ~[apache-cassandra-3.11.4.jar:3.11.4]
at org.apache.cassandra.streaming.messages.OutgoingFileMessage$1.serialize(OutgoingFileMessage.java:52) ~[apache-cassandra-3.11.4.jar:3.11.4]
at org.apache.cassandra.streaming.messages.OutgoingFileMessage$1.serialize(OutgoingFileMessage.java:41) ~[apache-cassandra-3.11.4.jar:3.11.4]
at org.apache.cassandra.streaming.messages.StreamMessage.serialize(StreamMessage.java:50) ~[apache-cassandra-3.11.4.jar:3.11.4]
at org.apache.cassandra.streaming.ConnectionHandler$OutgoingMessageHandler.sendMessage(ConnectionHandler.java:408) ~[apache-cassandra-3.11.4.jar:3.11.4]
at org.apache.cassandra.streaming.ConnectionHandler$OutgoingMessageHandler.run(ConnectionHandler.java:380) ~[apache-cassandra-3.11.4.jar:3.11.4]
at java.lang.Thread.run(Thread.java:748) [na:1.8.0_222]
Caused by: java.io.IOException: Broken pipe
at sun.nio.ch.FileChannelImpl.transferTo0(Native Method) ~[na:1.8.0_222]
at sun.nio.ch.FileChannelImpl.transferToDirectlyInternal(FileChannelImpl.java:428) ~[na:1.8.0_222]
at sun.nio.ch.FileChannelImpl.transferToDirectly(FileChannelImpl.java:493) ~[na:1.8.0_222]
at sun.nio.ch.FileChannelImpl.transferTo(FileChannelImpl.java:605) ~[na:1.8.0_222]
at org.apache.cassandra.io.util.ChannelProxy.transferTo(ChannelProxy.java:141) ~[apache-cassandra-3.11.4.jar:3.11.4]
... 10 common frames omitted
WARN [STREAM-IN-/10.128.1.3:56676] 2019-10-23 12:27:45,108 StreamResultFuture.java:214 - [Stream #1da92910-f58d-11e9-92dd-9765711f899a] Stream failed
INFO [RMI TCP Connection(124)-10.128.1.1] 2019-10-23 12:28:19,854 Gossiper.java:525 - Removing host: 0e8ad28d-6cc2-46df-8d3f-f346d464db40
INFO [RMI TCP Connection(124)-10.128.1.1] 2019-10-23 12:28:19,854 Gossiper.java:526 - Sleeping for 30000ms to ensure /10.128.1.3 does not change
INFO [RMI TCP Connection(124)-10.128.1.1] 2019-10-23 12:28:49,854 Gossiper.java:533 - Advertising removal for /10.128.1.3
INFO [RMI TCP Connection(124)-10.128.1.1] 2019-10-23 12:28:50,245 StreamResultFuture.java:90 - [Stream #aae08f50-f590-11e9-9934-850cf6bcace3] Executing streaming plan for Restore replica count
INFO [MiscStage:1] 2019-10-23 12:28:50,247 StorageService.java:4459 - Received unexpected REPLICATION_FINISHED message from /10.128.1.1. Was this node recently a removal coordinator?
INFO [RMI TCP Connection(124)-10.128.1.1] 2019-10-23 12:28:50,248 StorageService.java:2584 - Removing tokens [-9135980046459212380, -9100471967410923634, -9097242662756219549, -8974765285872613713, -895$
INFO [RMI TCP Connection(124)-10.128.1.1] 2019-10-23 12:28:50,317 Gossiper.java:557 - Completing removal of /10.128.1.3
INFO [HANDSHAKE-/10.128.1.3] 2019-10-23 12:31:35,019 OutboundTcpConnection.java:561 - Handshaking version with /10.128.1.3
I am totally clueless about why it's failing.
Can anyone point me in the right drection ?
I have made sure that the there is no firewall issue also I am not using SSL for internode communication.
org.apache.cassandra.io.FSReadError: java.io.IOException: Broken pipe
There could be a number of reasons for this (corrupted data, networking problems, schema issues between the two nodes) but basically it means the connection is getting severed and killing off the streaming in progress.
Networking issues are most likely. If you have any networking metrics try to use those for debbugging the connection.
There are a few things you can do to try to be clever here, the main thing is reduce the volume of streaming you need to do. You can achieve this by:
reducing the keyspace RF back to 1
Adding the node using auto_bootstrap: true in cassandra.yaml
Re-increasing the RF back to 2
Repairing the data
This will yield the same outcome where you've created 2 nodes that both contain 100% of the data, but during the node standup process, you only streamed 1/2 of that data. The repairs then in smaller sessions (smaller units of work) restored any other data that was missing to get you back to 100%.
On a side-note my advice would be for you to start regularly snapshotting your node, as there appear to be signs of bad health. Running a single node of Cassandra means you're not really protected from data loss, thats why C* is distrubuted and why replication_factor 3 is recommended for most setups.
Related
Flink - Failure at recovering from savepoint (checkpoint) . cause by java.lang.IllegalStateException: There is no operator for the state
The problem: The flink job-manager couldn't recovery from a checkpoint. Caused by: java.lang.IllegalStateException: There is no operator for the state Background: I'm running a flink 1.6.3 over k8s. and I'm using incremental checkpoint on rocksdb. I tryied to pass the parameter --allowNonRestoredState in order to skip savepoint state that cannot be restored From my log: 2019-02-06 08:51:08.068 [main] INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - --allowNonRestoredState 2019-02-06 08:51:22.827 [flink-akka.actor.default-dispatcher-14] INFO o.a.f.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - Recovering checkpoints from ZooKeeper. 2019-02-06 08:51:22.883 [flink-akka.actor.default-dispatcher-14] INFO o.a.f.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - Found 1 checkpoints in ZooKeeper. 2019-02-06 08:51:22.883 [flink-akka.actor.default-dispatcher-14] INFO o.a.f.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - Trying to fetch 1 checkpoints from storage. 2019-02-06 08:51:22.884 [flink-akka.actor.default-dispatcher-14] INFO o.a.f.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - Trying to retrieve checkpoint 1612. 2019-02-06 08:51:22.977 [flink-akka.actor.default-dispatcher-14] INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Restoring job 00000000000000000000000000000000 from latest valid checkpoint: Checkpoint 1612 # 1549376250641 for 00000000000000000000000000000000. 2019-02-06 08:51:22.982 [flink-akka.actor.default-dispatcher-14] ERROR org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Fatal error occurred in the cluster entrypoint. java.lang.RuntimeException: org.apache.flink.runtime.client.JobExecutionException: Could not set up JobManager at org.apache.flink.util.function.CheckedSupplier.lambda$unchecked$0(CheckedSupplier.java:36) at java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1590) at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:39) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:415) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) Caused by: org.apache.flink.runtime.client.JobExecutionException: Could not set up JobManager at org.apache.flink.runtime.jobmaster.JobManagerRunner.(JobManagerRunner.java:176) at org.apache.flink.runtime.dispatcher.Dispatcher$DefaultJobManagerRunnerFactory.createJobManagerRunner(Dispatcher.java:1058) at org.apache.flink.runtime.dispatcher.Dispatcher.lambda$createJobManagerRunner$5(Dispatcher.java:308) at org.apache.flink.util.function.CheckedSupplier.lambda$unchecked$0(CheckedSupplier.java:34) ... 7 common frames omitted Caused by: java.lang.IllegalStateException: There is no operator for the state b22e6e8baea7d7e562d5a233f3301ce1 at org.apache.flink.runtime.checkpoint.StateAssignmentOperation.checkStateMappingCompleteness(StateAssignmentOperation.java:569) at org.apache.flink.runtime.checkpoint.StateAssignmentOperation.assignStates(StateAssignmentOperation.java:77) at org.apache.flink.runtime.checkpoint.CheckpointCoordinator.restoreLatestCheckpointedState(CheckpointCoordinator.java:1049) at org.apache.flink.runtime.jobmaster.JobMaster.createAndRestoreExecutionGraph(JobMaster.java:1138) at org.apache.flink.runtime.jobmaster.JobMaster.(JobMaster.java:294) at org.apache.flink.runtime.jobmaster.JobManagerRunner.(JobManagerRunner.java:157) ... 10 common frames omitted 2019-02-06 08:51:23.013 [TransientBlobCache shutdown hook] INFO org.apache.flink.runtime.blob.TransientBlobCache - Shutting down BLOB cache 2019-02-06 08:51:23.033 [BlobServer shutdown hook] INFO org.apache.flink.runtime.blob.BlobServer - Stopped BLOB server at 0.0.0.0:6124 Expected result: The job will start running from the latest checkpoint and will skip the state that cannot be restored
Apache flink 1.6 HA standalone cluster: Fatal error in the cluster entrypoint
I am trying to setup Apache Flink standalone cluster consisting of 2 master nodes and one worker node. Using Flink 1.6 and Zookeeper. To start and stop cluster I used process described in Flink's 1.6 documentation, i.e. to start cluster I ran start-zookeeper-quorum.sh and then start-cluster.sh and to stop cluster I ran stop-cluster.sh After running one job (which failed), then stopping and restarting cluster again I noticed error where none of 2 the job managers could start because they are looking for directory job_e44fdee88a931200953fed45883ee3f1 which does not exist (I am assuming this is directory for my failed job, but not sure) How do I recover cluster from this error? 2018-09-06 14:58:04,065 ERROR org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Fatal error occurred in the cluster entrypoint. java.lang.RuntimeException: org.apache.flink.runtime.client.JobExecutionException: Could not set up JobManager at org.apache.flink.util.ExceptionUtils.rethrow(ExceptionUtils.java:199) at org.apache.flink.util.function.ConsumerWithException.accept(ConsumerWithException.java:40) at org.apache.flink.runtime.dispatcher.Dispatcher.lambda$waitForTerminatingJobManager$29(Dispatcher.java:820) at java.util.concurrent.CompletableFuture.uniRun(CompletableFuture.java:705) at java.util.concurrent.CompletableFuture$UniRun.tryFire(CompletableFuture.java:687) at java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:442) at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:332) at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:158) at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:70) at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.onReceive(AkkaRpcActor.java:142) at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.onReceive(FencedAkkaRpcActor.java:40) at akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:165) at akka.actor.Actor$class.aroundReceive(Actor.scala:502) at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:95) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526) at akka.actor.ActorCell.invoke(ActorCell.scala:495) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257) at akka.dispatch.Mailbox.run(Mailbox.scala:224) at akka.dispatch.Mailbox.exec(Mailbox.scala:234) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) Caused by: org.apache.flink.runtime.client.JobExecutionException: Could not set up JobManager at org.apache.flink.runtime.jobmaster.JobManagerRunner.<init>(JobManagerRunner.java:176) at org.apache.flink.runtime.dispatcher.Dispatcher$DefaultJobManagerRunnerFactory.createJobManagerRunner(Dispatcher.java:936) at org.apache.flink.runtime.dispatcher.Dispatcher.createJobManagerRunner(Dispatcher.java:291) at org.apache.flink.runtime.dispatcher.Dispatcher.runJob(Dispatcher.java:281) at org.apache.flink.util.function.ConsumerWithException.accept(ConsumerWithException.java:38) : ... 21 more Caused by: java.lang.Exception: Cannot set up the user code libraries: /hastorage/default/blob/job_e44fdee88a931200953fed45883ee3f1/blob_p-f655414c973995e93709acbd22c1c162c9c43a98-75bd4e71882f988a6c337222efadba7b (No such file or directory) at org.apache.flink.runtime.jobmaster.JobManagerRunner.<init>(JobManagerRunner.java:134) ... 25 more Caused by: java.io.FileNotFoundException: /hastorage/default/blob/job_e44fdee88a931200953fed45883ee3f1/blob_p-f655414c973995e93709acbd22c1c162c9c43a98-75bd4e71882f988a6c337222efadba7b (No such file or directory) at java.io.FileInputStream.open0(Native Method) at java.io.FileInputStream.open(FileInputStream.java:195) at java.io.FileInputStream.<init>(FileInputStream.java:138) at org.apache.flink.core.fs.local.LocalDataInputStream.<init>(LocalDataInputStream.java:50) at org.apache.flink.core.fs.local.LocalFileSystem.open(LocalFileSystem.java:142) at org.apache.flink.runtime.blob.FileSystemBlobStore.get(FileSystemBlobStore.java:102) at org.apache.flink.runtime.blob.FileSystemBlobStore.get(FileSystemBlobStore.java:84) at org.apache.flink.runtime.blob.BlobServer.getFileInternal(BlobServer.java:493) at org.apache.flink.runtime.blob.BlobServer.getFileInternal(BlobServer.java:444) at org.apache.flink.runtime.blob.BlobServer.getFile(BlobServer.java:417) at org.apache.flink.runtime.execution.librarycache.BlobLibraryCacheManager.registerTask(BlobLibraryCacheManager.java:120) at org.apache.flink.runtime.execution.librarycache.BlobLibraryCacheManager.registerJob(BlobLibraryCacheManager.java:91) at org.apache.flink.runtime.jobmaster.JobManagerRunner.<init>(JobManagerRunner.java:131) ... 25 more 2018-09-06 14:58:04,069 INFO org.apache.flink.runtime.blob.TransientBlobCache - Shutting down BLOB cache
The problem you are observing is caused by a bug in Flink. You can find more details about the problem here. The problem will be fixed with the next bug fix release.
Flink Streaming AsynchronousException{java.lang.Exception: Could not materialize checkpoint for operator map -> sink
AsynchronousException{java.lang.Exception: Could not materialize checkpoint 3547 for operator MetricsMap -> Sink: MetricsMapSink (66/80).} at org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.run(StreamTask.java:970) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.Exception: Could not materialize checkpoint 3547 for operator MetricsMap -> Sink: MetricsMapSink (66/80). ... 6 more Caused by: java.util.concurrent.ExecutionException: java.io.IOException: Could not flush and close the file system output stream to file:/nfsc/vol1/sma/ckdir/30aa873dd41d064436a05c0b3d0bcc75/chk-3547/cde53940-05bc-4ff8-a80e-5909faaefad3 in order to obtain the stream state handle at java.util.concurrent.FutureTask.report(FutureTask.java:122) at java.util.concurrent.FutureTask.get(FutureTask.java:192) at org.apache.flink.util.FutureUtil.runIfNotDoneAndGet(FutureUtil.java:43) at org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.run(StreamTask.java:897) ... 5 more Suppressed: java.lang.Exception: Could not properly cancel managed keyed state future. at org.apache.flink.streaming.api.operators.OperatorSnapshotResult.cancel(OperatorSnapshotResult.java:90) at org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.cleanup(StreamTask.java:1023) at org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.run(StreamTask.java:961) ... 5 more Caused by: java.util.concurrent.ExecutionException: java.io.IOException: Could not flush and close the file system output stream to file:/nfsc/vol1/sma/ckdir/30aa873dd41d064436a05c0b3d0bcc75/chk-3547/cde53940-05bc-4ff8-a80e-5909faaefad3 in order to obtain the stream state handle at java.util.concurrent.FutureTask.report(FutureTask.java:122) at java.util.concurrent.FutureTask.get(FutureTask.java:192) at org.apache.flink.util.FutureUtil.runIfNotDoneAndGet(FutureUtil.java:43) at org.apache.flink.runtime.state.StateUtil.discardStateFuture(StateUtil.java:85) at org.apache.flink.streaming.api.operators.OperatorSnapshotResult.cancel(OperatorSnapshotResult.java:88) ... 7 more Caused by: java.io.IOException: Could not flush and close the file system output stream to file:/nfsc/vol1/sma/ckdir/30aa873dd41d064436a05c0b3d0bcc75/chk-3547/cde53940-05bc-4ff8-a80e-5909faaefad3 in order to obtain the stream state handle Flink 1.3.2 use NAS for file storage bug? same as the 7590? how to trigger this problem? https://issues.apache.org/jira/browse/FLINK-7590
If you are using Flink version < 1.3.x https://issues.apache.org/jira/browse/FLINK-7180. If possible please upgrade.
Flink HA Cluster JobManager issues
I have a setup with flink 1.2 cluster, made up of 3 JobManagers and 2 TaskManagers. I start the Zookeeper Quorum from JobManager1, I get confirmation Zookeeper starts on the other 2 JobManagers then I start a Flink job on this JobManager1. The flink-conf.yaml is the same on all 5 VMs this means jobmanager.rpc.address: points to JobManager1 everywhere. If I turn off the VM running JobManager1 I would expect Zookeeper to say one of the remaining JobManagers is the leader and the TaskManagers should reconnect to it. Instead I get in the TaskManagers' logs a lot of these messages 2017-03-14 14:13:21,827 INFO org.apache.flink.runtime.taskmanager.TaskManager - Trying to register at JobManager akka.tcp://flink#1.2.3.4:43660/user/jobmanager (attempt 11, timeout: 30 seconds) 2017-03-14 14:13:21,836 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink#1.2.3.4:43660] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://flink#1.2.3.4:43660]] Caused by: [Connection refused: /1.2.3.4:43660] I modified the original IP to 1.2.3.4 for confidentiality and because it's always the same IP (of JobManager1). More logs: 2017-03-15 10:28:28,655 INFO org.apache.flink.core.fs.FileSystem - Ensuring all FileSystem streams are closed for Async calls on Source: Custom Source -> Flat Map (1/1) 2017-03-15 10:28:38,534 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink#1.2.3.4:44779] has failed, address is now gated for [5000] ms. Reason: [Disassociated] 2017-03-15 10:28:46,606 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink#1.2.3.4:44779] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://flink#1.2.3.4:44779]] Caused by: [Connection refused: /1.2.3.4:44779] 2017-03-15 10:28:52,431 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink#1.2.3.4:44779] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://flink#1.2.3.4:44779]] Caused by: [Connection refused: /1.2.3.4:44779] 2017-03-15 10:29:02,435 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink#1.2.3.4:44779] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://flink#1.2.3.4:44779]] Caused by: [Connection refused: /1.2.3.4:44779] 2017-03-15 10:29:10,489 INFO org.apache.flink.runtime.taskmanager.TaskManager - TaskManager akka://flink/user/taskmanager disconnects from JobManager akka.tcp://flink#1.2.3.4:44779/user/jobmanager: Old JobManager lost its leadership. 2017-03-15 10:29:10,490 INFO org.apache.flink.runtime.taskmanager.TaskManager - Cancelling all computations and discarding all cached data. 2017-03-15 10:29:10,491 INFO org.apache.flink.runtime.taskmanager.Task - Attempting to fail task externally Source: Custom Source -> Flat Map (1/1) (75fd495cc6acfd72fbe957e60e513223). 2017-03-15 10:29:10,491 INFO org.apache.flink.runtime.taskmanager.Task - Source: Custom Source -> Flat Map (1/1) (75fd495cc6acfd72fbe957e60e513223) switched from RUNNING to FAILED. java.lang.Exception: TaskManager akka://flink/user/taskmanager disconnects from JobManager akka.tcp://flink#1.2.3.4:44779/user/jobmanager: Old JobManager lost its leadership. at org.apache.flink.runtime.taskmanager.TaskManager.handleJobManagerDisconnect(TaskManager.scala:1074) at org.apache.flink.runtime.taskmanager.TaskManager.org$apache$flink$runtime$taskmanager$TaskManager$$handleJobManagerLeaderAddress(TaskManager.scala:1426) at org.apache.flink.runtime.taskmanager.TaskManager$$anonfun$handleMessage$1.applyOrElse(TaskManager.scala:286) at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36) at org.apache.flink.runtime.LeaderSessionMessageFilter$$anonfun$receive$1.applyOrElse(LeaderSessionMessageFilter.scala:44) at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36) at org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:33) at org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:28) at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123) at org.apache.flink.runtime.LogMessages$$anon$1.applyOrElse(LogMessages.scala:28) at akka.actor.Actor$class.aroundReceive(Actor.scala:467) at org.apache.flink.runtime.taskmanager.TaskManager.aroundReceive(TaskManager.scala:122) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516) at akka.actor.ActorCell.invoke(ActorCell.scala:487) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238) at akka.dispatch.Mailbox.run(Mailbox.scala:220) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:397) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) 2017-03-15 10:29:10,512 INFO org.apache.flink.runtime.taskmanager.Task - Triggering cancellation of task code Source: Custom Source -> Flat Map (1/1) (75fd495cc6acfd72fbe957e60e513223). 2017-03-15 10:29:10,515 INFO org.apache.flink.runtime.taskmanager.Task - Attempting to fail task externally Flat Map (1/1) (dd555e0437867c3180a1ecaf0a9f4d04). 2017-03-15 10:29:10,515 INFO org.apache.flink.runtime.taskmanager.Task - Flat Map (1/1) (dd555e0437867c3180a1ecaf0a9f4d04) switched from RUNNING to FAILED. java.lang.Exception: TaskManager akka://flink/user/taskmanager disconnects from JobManager akka.tcp://flink#1.2.3.4:44779/user/jobmanager: Old JobManager lost its leadership. at org.apache.flink.runtime.taskmanager.TaskManager.handleJobManagerDisconnect(TaskManager.scala:1074) at org.apache.flink.runtime.taskmanager.TaskManager.org$apache$flink$runtime$taskmanager$TaskManager$$handleJobManagerLeaderAddress(TaskManager.scala:1426) at org.apache.flink.runtime.taskmanager.TaskManager$$anonfun$handleMessage$1.applyOrElse(TaskManager.scala:286) at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36) at org.apache.flink.runtime.LeaderSessionMessageFilter$$anonfun$receive$1.applyOrElse(LeaderSessionMessageFilter.scala:44) at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36) at org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:33) at org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:28) at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123) at org.apache.flink.runtime.LogMessages$$anon$1.applyOrElse(LogMessages.scala:28) at akka.actor.Actor$class.aroundReceive(Actor.scala:467) at org.apache.flink.runtime.taskmanager.TaskManager.aroundReceive(TaskManager.scala:122) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516) at akka.actor.ActorCell.invoke(ActorCell.scala:487) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238) at akka.dispatch.Mailbox.run(Mailbox.scala:220) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:397) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) 2017-03-15 10:29:10,516 INFO org.apache.flink.runtime.taskmanager.Task - Triggering cancellation of task code Flat Map (1/1) (dd555e0437867c3180a1ecaf0a9f4d04). 2017-03-15 10:29:10,516 INFO org.apache.flink.runtime.taskmanager.TaskManager - Disassociating from JobManager 2017-03-15 10:29:10,525 INFO org.apache.flink.runtime.blob.BlobCache - Shutting down BlobCache 2017-03-15 10:29:10,542 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink#1.2.3.4:44779] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://flink#1.2.3.4:44779]] Caused by: [Connection refused: /1.2.3.4:44779] 2017-03-15 10:29:10,546 INFO org.apache.flink.runtime.taskmanager.Task - Freeing task resources for Source: Custom Source -> Flat Map (1/1) (75fd495cc6acfd72fbe957e60e513223). 2017-03-15 10:29:10,548 INFO org.apache.flink.runtime.taskmanager.Task - Freeing task resources for Flat Map (1/1) (dd555e0437867c3180a1ecaf0a9f4d04). 2017-03-15 10:29:10,551 INFO org.apache.flink.core.fs.FileSystem - Ensuring all FileSystem streams are closed for Flat Map (1/1) 2017-03-15 10:29:10,552 INFO org.apache.flink.runtime.taskmanager.TaskManager - Trying to register at JobManager akka.tcp://flink#1.2.3.5:43893/user/jobmanager (attempt 1, timeout: 500 milliseconds) 2017-03-15 10:29:10,567 INFO org.apache.flink.core.fs.FileSystem - Ensuring all FileSystem streams are closed for Source: Custom Source -> Flat Map (1/1) 2017-03-15 10:29:10,632 INFO org.apache.flink.runtime.taskmanager.TaskManager - Successful registration at JobManager (akka.tcp://flink#1.2.3.5:43893/user/jobmanager), starting network stack and library cache. 2017-03-15 10:29:10,633 INFO org.apache.flink.runtime.taskmanager.TaskManager - Determined BLOB server address to be /1.2.3.5:42830. Starting BLOB cache. 2017-03-15 10:29:10,633 INFO org.apache.flink.runtime.blob.BlobCache - Created BLOB cache storage directory /tmp/blobStore-d97e08db-d2f1-4f00-a7d1-30c2f5823934 2017-03-15 10:29:15,551 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink#1.2.3.4:44779] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://flink#1.2.3.4:44779]] Caused by: [Connection refused: /1.2.3.4:44779] 2017-03-15 10:29:20,571 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink#1.2.3.4:44779] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://flink#1.2.3.4:44779]] Caused by: [Connection refused: /1.2.3.4:44779] 2017-03-15 10:29:25,582 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink#1.2.3.4:44779] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://flink#1.2.3.4:44779]] Caused by: [Connection refused: /1.2.3.4:44779] 2017-03-15 10:29:30,592 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink#1.2.3.4:44779] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://flink#1.2.3.4:44779]] Caused by: [Connection refused: /1.2.3.4:44779] Does anyone know why the TaskManagers are not trying to reconnect to one of the remaining JobManagers (like 1.2.3.5 above)? Thanks!
For everyone facing the same issue, HA requires you to provide a DFS location accessible from all nodes. I had backend state checkpoint directory and zookeeper storage directory pointing on each VM to a local filesystem location and when one of the JobManagers went down the new leader couldn't resume the running jobs because of lack of information / location not accessible. Edit: Since this was asked, the file I modified (In the case of Apache Flink 1.2 (https://ci.apache.org/projects/flink/flink-docs-release-1.2/setup/config.html)) was conf/flink-conf.yaml I set state.backend.fs.checkpointdir high-availability.zookeeper.storageDir to AWS S3 paths .accessible from both TaskManagers and JobManagers.
org.apache.camel.component.file.GenericFileOperationFailedException - Cannot list directory:
Their is a similar posting that claims to have the answer but I'm still getting the error after putting step -> stepwise=false in my camel route. -> exception 14:35:33,649 WARN [org.apache.camel.component.file.remote.RemoteFilePollingConsumerPollStrategy] (Camel (camel-1) thread #13 - sftp://myftp:22) Trying to recover by discon necting from remote server forcing a re-connect at next poll: sftp://myUser#myftp:22 14:35:33,654 WARN [org.apache.camel.component.file.remote.SftpConsumer] (Camel (camel-1) thread #13 - sftp://myftp:22) Consumer Consumer[sftp://myftp:22?delay =1h&delete=true&doneFileName=done&password=xxxxxx&sortBy=ignoreCase%3Afile%3Aname&stepwise=false&username=myUser] failed polling endpoint: Endpoint[sftp://myftp:22?delay=1h &delete=true&doneFileName=done&password=xxxxxx&sortBy=ignoreCase%3Afile%3Aname&stepwise=false&username=myUser]. Will try again at next poll. Caused by: [org.apache.camel.component.file. GenericFileOperationFailedException - Cannot list directory: .]: org.apache.camel.component.file.GenericFileOperationFailedException: Cannot list directory: . at org.apache.camel.component.file.remote.SftpOperations.listFiles(SftpOperations.java:583) [camel-ftp-2.13.2.jar:2.13.2] at org.apache.camel.component.file.remote.SftpConsumer.doPollDirectory(SftpConsumer.java:90) [camel-ftp-2.13.2.jar:2.13.2] at org.apache.camel.component.file.remote.SftpConsumer.pollDirectory(SftpConsumer.java:52) [camel-ftp-2.13.2.jar:2.13.2] at org.apache.camel.component.file.GenericFileConsumer.poll(GenericFileConsumer.java:119) [camel-core-2.13.2.jar:2.13.2] at org.apache.camel.impl.ScheduledPollConsumer.doRun(ScheduledPollConsumer.java:187) [camel-core-2.13.2.jar:2.13.2] at org.apache.camel.impl.ScheduledPollConsumer.run(ScheduledPollConsumer.java:114) [camel-core-2.13.2.jar:2.13.2] at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) [rt.jar:1.7.0_65] at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:304) [rt.jar:1.7.0_65] at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178) [rt.jar:1.7.0_65] at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) [rt.jar:1.7.0_65] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) [rt.jar:1.7.0_65] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [rt.jar:1.7.0_65] at java.lang.Thread.run(Thread.java:745) [rt.jar:1.7.0_65] Caused by: 4: at com.jcraft.jsch.ChannelSftp.ls(ChannelSftp.java:1660) [jsch-0.1.49.jar:] at com.jcraft.jsch.ChannelSftp.ls(ChannelSftp.java:1466) [jsch-0.1.49.jar:] at org.apache.camel.component.file.remote.SftpOperations.listFiles(SftpOperations.java:574) [camel-ftp-2.13.2.jar:2.13.2] ... 12 more Caused by: java.io.IOException: Pipe closed at java.io.PipedInputStream.read(PipedInputStream.java:308) [rt.jar:1.7.0_65] at com.jcraft.jsch.Channel$MyPipedInputStream.updateReadSide(Channel.java:344) [jsch-0.1.49.jar:] at com.jcraft.jsch.ChannelSftp.ls(ChannelSftp.java:1483) [jsch-0.1.49.jar:] ... 14 more here is my route -> sftp://myftp:22?delay =1h&delete=true&doneFileName=done&password=xxxxxx&sortBy=ignoreCase%3Afile%3Aname&stepwise=false&username=myUser
Yeah, there is java.io.IOException: Pipe closed. I just checked the code of camel-ftp, it has the code to check the connection, but it's hard to know if the connection is still opened if we don't send some bytes to server socket. The solution could be force the ftp client send some ping message to check if the connection is still open. If you want to set a very long delay for your ftp pulling, the work around way is setting disconnect option to be true to force ftp endpoint restart a new connection when it pull the directory again from the FTP server. I just created a JIRA for it.