How can I see if a job failed and why? - apache-flink

How can I use ClusterClient to check if a job failed and why?
ClusterClient#getJobStatus may seem like a good first candidate but it only says if the job failed without any information regarding the exceptions.
The submission of the job is being done with a detached client therefore waiting for its ClusterClient#run to return a JobExecutionResult is not an option.
I've also tried
RestClusterClient#retrieveJob also does not work, failing with:
org.apache.flink.runtime.client.JobRetrievalException: Couldn't
retrieve leading JobManager. at
org.apache.flink.runtime.client.JobListeningContext.getJobManager(JobListeningContext.java:157)
at
org.apache.flink.runtime.client.JobListeningContext.getClassLoader(JobListeningContext.java:141)
at
org.apache.flink.runtime.client.JobClient.awaitJobResult(JobClient.java:262)
at
org.apache.flink.client.program.ClusterClient.retrieveJob(ClusterClient.java:586)
at java.lang.Thread.run(Thread.java:745)
Caused by:
org.apache.flink.runtime.leaderretrieval.LeaderRetrievalException:
Could not retrieve the leader gateway. at
org.apache.flink.runtime.util.LeaderRetrievalUtils.retrieveLeaderGateway(LeaderRetrievalUtils.java:82)
at
org.apache.flink.runtime.client.JobListeningContext.getJobManager(JobListeningContext.java:152)
... 10 more
Caused by: java.util.concurrent.TimeoutException: Futures
timed out after [10000 milliseconds] at
scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:223)
at
scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:227)
at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:190)
at
scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
at scala.concurrent.Await$.result(package.scala:190) at
scala.concurrent.Await.result(package.scala) at
org.apache.flink.runtime.util.LeaderRetrievalUtils.retrieveLeaderGateway(LeaderRetrievalUtils.java:80)
... 11 more

Use NewClusterClient#requestJobResult which can be done using a RestClusterClient.

Related

Flink - Failed to recover from a checkpoint

I'm running my cluster on kubernetes with a single jobmanager and 2 taskmanagers.
I tested the mechanism of checkpoint by killing one of the taskmanager pods while the job is running.
I got the following exceptions on the jobmanager and the restarted taskmanager:
Jobmanager exception:
java.lang.Exception: Exception while creating StreamOperatorStateContext.
at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.streamOperatorStateContext(StreamTaskStateInitializerImpl.java:195)
at org.apache.flink.streaming.api.operators.AbstractStreamOperator.initializeState(AbstractStreamOperator.java:253)
at org.apache.flink.streaming.runtime.tasks.StreamTask.initializeState(StreamTask.java:881)
at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:395)
at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:705)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:530)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.flink.util.FlinkException: Could not restore keyed state backend for WindowOperator_54288f79b169ee3e8cb1feb33bbad4c3_(1/8) from any of the 1 provided restore options.
at org.apache.flink.streaming.api.operators.BackendRestorerProcedure.createAndRestore(BackendRestorerProcedure.java:135)
at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.keyedStatedBackend(StreamTaskStateInitializerImpl.java:307)
at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.streamOperatorStateContext(StreamTaskStateInitializerImpl.java:135)
... 6 more
Caused by: org.apache.flink.runtime.state.BackendBuildingException: Caught unexpected exception.
at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackendBuilder.build(RocksDBKeyedStateBackendBuilder.java:326)
at org.apache.flink.contrib.streaming.state.RocksDBStateBackend.createKeyedStateBackend(RocksDBStateBackend.java:520)
at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.lambda$keyedStatedBackend$1(StreamTaskStateInitializerImpl.java:291)
at org.apache.flink.streaming.api.operators.BackendRestorerProcedure.attemptCreateAndRestore(BackendRestorerProcedure.java:142)
at org.apache.flink.streaming.api.operators.BackendRestorerProcedure.createAndRestore(BackendRestorerProcedure.java:121)
... 8 more
Caused by: java.nio.file.NoSuchFileException: /rocksdb/job_0a1a61f5cbecc09fbaef1257b3392b3a_op_WindowOperator_54288f79b169ee3e8cb1feb33bbad4c3__1_8__uuid_8b95eb2f-f6cf-4c35-8274-a9055376163d/db/000021.sst -> /rocksdb/job_0a1a61f5cbecc09fbaef1257b3392b3a_op_WindowOperator_54288f79b169ee3e8cb1feb33bbad4c3__1_8__uuid_8b95eb2f-f6cf-4c35-8274-a9055376163d/f1a97117-3810-400e-85ca-6e8c998a5ed4/000021.sst
at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86)
at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
at sun.nio.fs.UnixFileSystemProvider.createLink(UnixFileSystemProvider.java:476)
at java.nio.file.Files.createLink(Files.java:1086)
at org.apache.flink.contrib.streaming.state.restore.RocksDBIncrementalRestoreOperation.restoreInstanceDirectoryFromPath(RocksDBIncrementalRestoreOperation.java:473)
at org.apache.flink.contrib.streaming.state.restore.RocksDBIncrementalRestoreOperation.restoreFromLocalState(RocksDBIncrementalRestoreOperation.java:212)
at org.apache.flink.contrib.streaming.state.restore.RocksDBIncrementalRestoreOperation.restoreFromRemoteState(RocksDBIncrementalRestoreOperation.java:188)
at org.apache.flink.contrib.streaming.state.restore.RocksDBIncrementalRestoreOperation.restoreWithoutRescaling(RocksDBIncrementalRestoreOperation.java:162)
at org.apache.flink.contrib.streaming.state.restore.RocksDBIncrementalRestoreOperation.restore(RocksDBIncrementalRestoreOperation.java:148)
at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackendBuilder.build(RocksDBKeyedStateBackendBuilder.java:270)
... 12 more
Taskmanager exception:
2020-01-13 09:26:01,943 ERROR org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackendBuilder - Caught unexpected exception.
org.apache.flink.fs.s3base.shaded.com.amazonaws.SdkClientException: Failed to sanitize XML document destined for handler class org.apache.flink.fs.s3base.shaded.com.amazonaws.services.s3.model.transform.XmlResponsesSaxParser$ListBucketHandler
at org.apache.flink.fs.s3base.shaded.com.amazonaws.services.s3.model.transform.XmlResponsesSaxParser.sanitizeXmlDocument(XmlResponsesSaxParser.java:219)
at org.apache.flink.fs.s3base.shaded.com.amazonaws.services.s3.model.transform.XmlResponsesSaxParser.parseListBucketObjectsResponse(XmlResponsesSaxParser.java:317)
at org.apache.flink.fs.s3base.shaded.com.amazonaws.services.s3.model.transform.Unmarshallers$ListObjectsUnmarshaller.unmarshall(Unmarshallers.java:70)
at org.apache.flink.fs.s3base.shaded.com.amazonaws.services.s3.model.transform.Unmarshallers$ListObjectsUnmarshaller.unmarshall(Unmarshallers.java:59)
at org.apache.flink.fs.s3base.shaded.com.amazonaws.services.s3.internal.S3XmlResponseHandler.handle(S3XmlResponseHandler.java:62)
at org.apache.flink.fs.s3base.shaded.com.amazonaws.services.s3.internal.S3XmlResponseHandler.handle(S3XmlResponseHandler.java:31)
at org.apache.flink.fs.s3base.shaded.com.amazonaws.http.response.AwsResponseHandlerAdapter.handle(AwsResponseHandlerAdapter.java:70)
at org.apache.flink.fs.s3base.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleResponse(AmazonHttpClient.java:1554)
at org.apache.flink.fs.s3base.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1272)
at org.apache.flink.fs.s3base.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1056)
at org.apache.flink.fs.s3base.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:743)
at org.apache.flink.fs.s3base.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:717)
at org.apache.flink.fs.s3base.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:699)
at org.apache.flink.fs.s3base.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$500(AmazonHttpClient.java:667)
at org.apache.flink.fs.s3base.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:649)
at org.apache.flink.fs.s3base.shaded.com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:513)
at org.apache.flink.fs.s3base.shaded.com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:4325)
at org.apache.flink.fs.s3base.shaded.com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:4272)
at org.apache.flink.fs.s3base.shaded.com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:4266)
at org.apache.flink.fs.s3base.shaded.com.amazonaws.services.s3.AmazonS3Client.listObjects(AmazonS3Client.java:834)
at org.apache.flink.fs.s3presto.shaded.com.facebook.presto.hive.s3.PrestoS3FileSystem.listPrefix(PrestoS3FileSystem.java:484)
at org.apache.flink.fs.s3presto.shaded.com.facebook.presto.hive.s3.PrestoS3FileSystem.access$000(PrestoS3FileSystem.java:112)
at org.apache.flink.fs.s3presto.shaded.com.facebook.presto.hive.s3.PrestoS3FileSystem$1.<init>(PrestoS3FileSystem.java:271)
at org.apache.flink.fs.s3presto.shaded.com.facebook.presto.hive.s3.PrestoS3FileSystem.listLocatedStatus(PrestoS3FileSystem.java:269)
at org.apache.flink.fs.s3presto.shaded.com.facebook.presto.hive.s3.PrestoS3FileSystem.listStatus(PrestoS3FileSystem.java:258)
at org.apache.flink.fs.s3.common.hadoop.HadoopFileSystem.listStatus(HadoopFileSystem.java:157)
at org.apache.flink.core.fs.SafetyNetWrapperFileSystem.listStatus(SafetyNetWrapperFileSystem.java:97)
at org.apache.flink.contrib.streaming.state.restore.RocksDBIncrementalRestoreOperation.restoreInstanceDirectoryFromPath(RocksDBIncrementalRestoreOperation.java:460)
at org.apache.flink.contrib.streaming.state.restore.RocksDBIncrementalRestoreOperation.restoreFromLocalState(RocksDBIncrementalRestoreOperation.java:212)
at org.apache.flink.contrib.streaming.state.restore.RocksDBIncrementalRestoreOperation.restoreFromRemoteState(RocksDBIncrementalRestoreOperation.java:188)
at org.apache.flink.contrib.streaming.state.restore.RocksDBIncrementalRestoreOperation.restoreWithoutRescaling(RocksDBIncrementalRestoreOperation.java:162)
at org.apache.flink.contrib.streaming.state.restore.RocksDBIncrementalRestoreOperation.restore(RocksDBIncrementalRestoreOperation.java:148)
at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackendBuilder.build(RocksDBKeyedStateBackendBuilder.java:270)
at org.apache.flink.contrib.streaming.state.RocksDBStateBackend.createKeyedStateBackend(RocksDBStateBackend.java:520)
at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.lambda$keyedStatedBackend$1(StreamTaskStateInitializerImpl.java:291)
at org.apache.flink.streaming.api.operators.BackendRestorerProcedure.attemptCreateAndRestore(BackendRestorerProcedure.java:142)
at org.apache.flink.streaming.api.operators.BackendRestorerProcedure.createAndRestore(BackendRestorerProcedure.java:121)
at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.keyedStatedBackend(StreamTaskStateInitializerImpl.java:307)
at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.streamOperatorStateContext(StreamTaskStateInitializerImpl.java:135)
at org.apache.flink.streaming.api.operators.AbstractStreamOperator.initializeState(AbstractStreamOperator.java:253)
at org.apache.flink.streaming.runtime.tasks.StreamTask.initializeState(StreamTask.java:881)
at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:395)
at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:705)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:530)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.flink.fs.s3base.shaded.com.amazonaws.AbortedException:
at org.apache.flink.fs.s3base.shaded.com.amazonaws.internal.SdkFilterInputStream.abortIfNeeded(SdkFilterInputStream.java:53)
at org.apache.flink.fs.s3base.shaded.com.amazonaws.internal.SdkFilterInputStream.read(SdkFilterInputStream.java:81)
at org.apache.flink.fs.s3base.shaded.com.amazonaws.event.ProgressInputStream.read(ProgressInputStream.java:180)
at sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:284)
at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:326)
at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:178)
at java.io.InputStreamReader.read(InputStreamReader.java:184)
at java.io.BufferedReader.read1(BufferedReader.java:210)
at java.io.BufferedReader.read(BufferedReader.java:286)
at java.io.Reader.read(Reader.java:140)
at org.apache.flink.fs.s3base.shaded.com.amazonaws.services.s3.model.transform.XmlResponsesSaxParser.sanitizeXmlDocument(XmlResponsesSaxParser.java:191)
... 44 more
2020-01-13 09:26:01,944 WARN org.apache.flink.streaming.api.operators.BackendRestorerProcedure - Exception while restoring keyed state backend for WindowOperator_54288f79b169ee3e8cb1feb33bbad4c3_(7/8) from alternative (1/1), will retry while more alternatives are available.
org.apache.flink.runtime.state.BackendBuildingException: Caught unexpected exception.
at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackendBuilder.build(RocksDBKeyedStateBackendBuilder.java:326)
at org.apache.flink.contrib.streaming.state.RocksDBStateBackend.createKeyedStateBackend(RocksDBStateBackend.java:520)
at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.lambda$keyedStatedBackend$1(StreamTaskStateInitializerImpl.java:291)
at org.apache.flink.streaming.api.operators.BackendRestorerProcedure.attemptCreateAndRestore(BackendRestorerProcedure.java:142)
at org.apache.flink.streaming.api.operators.BackendRestorerProcedure.createAndRestore(BackendRestorerProcedure.java:121)
at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.keyedStatedBackend(StreamTaskStateInitializerImpl.java:307)
at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.streamOperatorStateContext(StreamTaskStateInitializerImpl.java:135)
at org.apache.flink.streaming.api.operators.AbstractStreamOperator.initializeState(AbstractStreamOperator.java:253)
at org.apache.flink.streaming.runtime.tasks.StreamTask.initializeState(StreamTask.java:881)
at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:395)
at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:705)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:530)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.flink.fs.s3base.shaded.com.amazonaws.SdkClientException: Failed to sanitize XML document destined for handler class org.apache.flink.fs.s3base.shaded.com.amazonaws.services.s3.model.transform.XmlResponsesSaxParser$ListBucketHandler
at org.apache.flink.fs.s3base.shaded.com.amazonaws.services.s3.model.transform.XmlResponsesSaxParser.sanitizeXmlDocument(XmlResponsesSaxParser.java:219)
at org.apache.flink.fs.s3base.shaded.com.amazonaws.services.s3.model.transform.XmlResponsesSaxParser.parseListBucketObjectsResponse(XmlResponsesSaxParser.java:317)
at org.apache.flink.fs.s3base.shaded.com.amazonaws.services.s3.model.transform.Unmarshallers$ListObjectsUnmarshaller.unmarshall(Unmarshallers.java:70)
at org.apache.flink.fs.s3base.shaded.com.amazonaws.services.s3.model.transform.Unmarshallers$ListObjectsUnmarshaller.unmarshall(Unmarshallers.java:59)
at org.apache.flink.fs.s3base.shaded.com.amazonaws.services.s3.internal.S3XmlResponseHandler.handle(S3XmlResponseHandler.java:62)
at org.apache.flink.fs.s3base.shaded.com.amazonaws.services.s3.internal.S3XmlResponseHandler.handle(S3XmlResponseHandler.java:31)
at org.apache.flink.fs.s3base.shaded.com.amazonaws.http.response.AwsResponseHandlerAdapter.handle(AwsResponseHandlerAdapter.java:70)
at org.apache.flink.fs.s3base.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleResponse(AmazonHttpClient.java:1554)
at org.apache.flink.fs.s3base.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1272)
at org.apache.flink.fs.s3base.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1056)
at org.apache.flink.fs.s3base.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:743)
at org.apache.flink.fs.s3base.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:717)
at org.apache.flink.fs.s3base.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:699)
at org.apache.flink.fs.s3base.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$500(AmazonHttpClient.java:667)
at org.apache.flink.fs.s3base.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:649)
at org.apache.flink.fs.s3base.shaded.com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:513)
at org.apache.flink.fs.s3base.shaded.com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:4325)
at org.apache.flink.fs.s3base.shaded.com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:4272)
at org.apache.flink.fs.s3base.shaded.com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:4266)
at org.apache.flink.fs.s3base.shaded.com.amazonaws.services.s3.AmazonS3Client.listObjects(AmazonS3Client.java:834)
at org.apache.flink.fs.s3presto.shaded.com.facebook.presto.hive.s3.PrestoS3FileSystem.listPrefix(PrestoS3FileSystem.java:484)
at org.apache.flink.fs.s3presto.shaded.com.facebook.presto.hive.s3.PrestoS3FileSystem.access$000(PrestoS3FileSystem.java:112)
at org.apache.flink.fs.s3presto.shaded.com.facebook.presto.hive.s3.PrestoS3FileSystem$1.<init>(PrestoS3FileSystem.java:271)
at org.apache.flink.fs.s3presto.shaded.com.facebook.presto.hive.s3.PrestoS3FileSystem.listLocatedStatus(PrestoS3FileSystem.java:269)
at org.apache.flink.fs.s3presto.shaded.com.facebook.presto.hive.s3.PrestoS3FileSystem.listStatus(PrestoS3FileSystem.java:258)
at org.apache.flink.fs.s3.common.hadoop.HadoopFileSystem.listStatus(HadoopFileSystem.java:157)
at org.apache.flink.core.fs.SafetyNetWrapperFileSystem.listStatus(SafetyNetWrapperFileSystem.java:97)
at org.apache.flink.contrib.streaming.state.restore.RocksDBIncrementalRestoreOperation.restoreInstanceDirectoryFromPath(RocksDBIncrementalRestoreOperation.java:460)
at org.apache.flink.contrib.streaming.state.restore.RocksDBIncrementalRestoreOperation.restoreFromLocalState(RocksDBIncrementalRestoreOperation.java:212)
at org.apache.flink.contrib.streaming.state.restore.RocksDBIncrementalRestoreOperation.restoreFromRemoteState(RocksDBIncrementalRestoreOperation.java:188)
at org.apache.flink.contrib.streaming.state.restore.RocksDBIncrementalRestoreOperation.restoreWithoutRescaling(RocksDBIncrementalRestoreOperation.java:162)
at org.apache.flink.contrib.streaming.state.restore.RocksDBIncrementalRestoreOperation.restore(RocksDBIncrementalRestoreOperation.java:148)
at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackendBuilder.build(RocksDBKeyedStateBackendBuilder.java:270)
... 12 more
Caused by: org.apache.flink.fs.s3base.shaded.com.amazonaws.AbortedException:
at org.apache.flink.fs.s3base.shaded.com.amazonaws.internal.SdkFilterInputStream.abortIfNeeded(SdkFilterInputStream.java:53)
at org.apache.flink.fs.s3base.shaded.com.amazonaws.internal.SdkFilterInputStream.read(SdkFilterInputStream.java:81)
at org.apache.flink.fs.s3base.shaded.com.amazonaws.event.ProgressInputStream.read(ProgressInputStream.java:180)
at sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:284)
at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:326)
at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:178)
at java.io.InputStreamReader.read(InputStreamReader.java:184)
at java.io.BufferedReader.read1(BufferedReader.java:210)
at java.io.BufferedReader.read(BufferedReader.java:286)
at java.io.Reader.read(Reader.java:140)
at org.apache.flink.fs.s3base.shaded.com.amazonaws.services.s3.model.transform.XmlResponsesSaxParser.sanitizeXmlDocument(XmlResponsesSaxParser.java:191)
... 44 more
When I tried to restore from a savepoint, everything works as expected.
Any idea?
From our experience, one possible cause could be you come across below exception first:
Caused by: org.apache.flink.fs.s3base.shaded.com.amazonaws.SdkClientException: Failed to sanitize XML document destined for handler class org.apache.flink.fs.s3base.shaded.com.amazonaws.services.s3.model.transform.XmlResponsesSaxParser$ListBucketHandler
Then the procedure to restore rocksdb state-backend would be interrupted leading to file /rocksdb/job_0a1a61f5cbecc09fbaef1257b3392b3a_op_WindowOperator_54288f79b169ee3e8cb1feb33bbad4c3__1_8__uuid_8b95eb2f-f6cf-4c35-8274-a9055376163d/f1a97117-3810-400e-85ca-6e8c998a5ed4/000021.sst deleted in https://github.com/apache/flink/blob/390926e61aeb69837c70a024ad6e7ff02eccdf2d/flink-state-backends/flink-statebackend-rocksdb/src/main/java/org/apache/flink/contrib/streaming/state/restore/RocksDBIncrementalRestoreOperation.java#L197
That's why you found the NoSuchFileException.

Flink job submission failed for akka.pattern.AskTimeoutException

Job submission failed for akka.pattern.AskTimeoutException, I have tried to set akka.ask.timeout: "60s", still failed for the same error. Any idea?
akka.pattern.AskTimeoutException: Ask timed out on [Actor[akka://flink/user/dispatcher#-1594246162]] after [10000 ms]. Message of type [org.apache.flink.runtime.rpc.messages.LocalFencedMessage]. A typical reason for `AskTimeoutException` is that the recipient actor didn't send a reply.
at akka.pattern.PromiseActorRef$$anonfun$2.apply(AskSupport.scala:635)
at akka.pattern.PromiseActorRef$$anonfun$2.apply(AskSupport.scala:635)
at akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:648)
at akka.actor.Scheduler$$anon$4.run(Scheduler.scala:205)
at scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601)
at scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109)
at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599)
at akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:328)
at akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket$1(LightArrayRevolverScheduler.scala:279)
at akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(LightArrayRevolverScheduler.scala:283)
at akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArrayRevolverScheduler.scala:235)
This seems to be a known issue FLINK-11143. Hope it could be resolved soon.

Flow has failed with error Shutting down because of violation of the Reactive Streams specification

It seems I can never get the error handling right when using Akka Streams.
So this is my code
var db = Database.forConfig("oracle")
var mysqlDb = Database.forConfig("mysql_read")
var mysqlDbWrite = Database.forConfig("mysql_write")
implicit val actorSystem = ActorSystem()
val decider : Supervision.Decider = {
case _: Exception =>
println("got an exception restarting connections")
// let us restart our connections
db.close()
mysqlDb.close()
mysqlDbWrite.close()
db = Database.forConfig("oracle")
mysqlDb = Database.forConfig("mysql_read")
mysqlDbWrite = Database.forConfig("mysql_write")
Supervision.Restart
}
implicit val materializer = ActorMaterializer(ActorMaterializerSettings(actorSystem).withSupervisionStrategy(decider))
and I have a flow like this
val alreadyExistsFilter : Flow[Foo, Foo, NotUsed] = Flow[Foo].mapAsync(10){ foo =>
try {
val existsQuery = sql"""SELECT id FROM foo WHERE id = ${foo.id}""".as[Long]
mysqlDbWrite.run(existsQuery).map(v => (foo, v))
} catch {
case e: Throwable =>
println(s"Lookup failed for ${foo}")
throw e // will restart the stream
}
}.collect {case (f, v) if v.isEmpty => f}
So basically if the foo already exists in MySQL then the record should not be processed any further by the stream.
My hope with this code was that if anything fails with the mysql lookup (the mysql machine is pretty bad and timeouts are common), the record will be printed and discarded and the stream will continue with the remaining records courtesy of the supervision.
When I run this code. I see errors like
[error] (mysql_write network timeout executor) java.lang.RuntimeException: java.sql.SQLException: Invalid socket timeout value or state
java.lang.RuntimeException: java.sql.SQLException: Invalid socket timeout value or state
at com.mysql.jdbc.ConnectionImpl$12.run(ConnectionImpl.java:5576)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.sql.SQLException: Invalid socket timeout value or state
at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:998)
at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:937)
at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:926)
at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:872)
at com.mysql.jdbc.MysqlIO.setSocketTimeout(MysqlIO.java:4852)
at com.mysql.jdbc.ConnectionImpl$12.run(ConnectionImpl.java:5574)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.net.SocketException: Socket is closed
at java.net.Socket.setSoTimeout(Socket.java:1137)
at com.mysql.jdbc.MysqlIO.setSocketTimeout(MysqlIO.java:4850)
at com.mysql.jdbc.ConnectionImpl$12.run(ConnectionImpl.java:5574)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
and
[error] (mysql_write network timeout executor) java.lang.NullPointerException
java.lang.NullPointerException
at com.mysql.jdbc.MysqlIO.setSocketTimeout(MysqlIO.java:4850)
at com.mysql.jdbc.ConnectionImpl$12.run(ConnectionImpl.java:5574)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
One thing which surprises me here is that these exceptions don't come from my catch block. because I don't see the println statement of my catch block. The stack trace doesn't show me where it originated from... but since it is saying mysql_write I can assume that its the Flow above because only this Flow uses mysql_write.
Finally the entire stream crashes with the error
[trace] Stack trace suppressed: run last compile:runMain for the full output.
flow has failed with error Shutting down because of violation of the Reactive Streams specification.
14:51:06,973 |-INFO in ch.qos.logback.classic.AsyncAppender[asyncKafkaAppender] - Worker thread will flush remaining events before exiting.
[success] Total time: 3480 s, completed Sep 26, 2017 2:51:07 PM
14:51:07,603 |-INFO in ch.qos.logback.core.hook.DelayingShutdownHook#2320545b - Sleeping for 1 seconds
I don't know what I did to violate the reactive streams specification!!
A first stab at getting a more predictable solution would be removing the blocking behaviour (Await.result) and use mapAsync. A rewrite of the alreadyExistsFilter flow could be:
val alreadyExistsFilter : Flow[Foo, Foo, NotUsed] = Flow[Foo].mapAsync(3) { foo ⇒
val existsQuery = sql"""SELECT id FROM foo WHERE id = ${foo.id}""".as[Long]
foo → Await.result(mysqlDbWrite.run(existsQuery), Duration.Inf)
}.collect{
case (foo, res) if res.isDefined ⇒ foo
}
More info on blocking in Akka can be found in the docs.
The answer given by Stefano is correct. The error was indeed coming because of blocking code in the flow.
Although, my initial program was running against scala 2.11 and even after switching to the mapAsync, the problem persisted.
Since this is a command line tool it was easy for me to switch to scala 2.12 and try again.
When I tried with Scala 2.12 it worked perfectly.
One thing which greatly helped me is to have "ch.qos.logback" % "logback-classic" % "1.2.3", in the dependencies. This will show you each and every SQL statement which is being executed and easily see if something is going wrong.

AppEngine RemoteAPI SocketTimeoutException

I'm using RemoteAPI to fetch entities from GAE Datastore, 300 at a time.
I'm doing something along the lines of:
while(!(emails = getEmails()).isEmpty()) {
Filter filter = new FilterPredicate("email", FilterOperator.IN, emails)
Query query = new Query("MyEntity").setFilter(filter);
QueryResultIterable<Entity> result = ds.prepare(query).asQueryResultIterable();
for (Entity entity : result) {
System.out.println(entity.getProperty("name"));
}
}
I'm processing something like 50k emails. The first time I ran this code it got to maybe 3/4 of the way, then it threw the following exception. Now it throws it after a single loop iteration is run.
com.google.appengine.tools.remoteapi.RemoteApiException: remote API call: I/O error
at com.google.appengine.tools.remoteapi.RemoteRpc.makeException(RemoteRpc.java:160)
at com.google.appengine.tools.remoteapi.RemoteRpc.callImpl(RemoteRpc.java:104)
at com.google.appengine.tools.remoteapi.RemoteRpc.call(RemoteRpc.java:50)
at com.google.appengine.tools.remoteapi.RemoteDatastore.runQuery(RemoteDatastore.java:156)
at com.google.appengine.tools.remoteapi.RemoteDatastore.handleRunQuery(RemoteDatastore.java:115)
at com.google.appengine.tools.remoteapi.RemoteDatastore.handleDatastoreCall(RemoteDatastore.java:93)
at com.google.appengine.tools.remoteapi.RemoteApiDelegate.makeDefaultSyncCall(RemoteApiDelegate.java:57)
at com.google.appengine.tools.remoteapi.StandaloneRemoteApiDelegate.makeSyncCall(StandaloneRemoteApiDelegate.java:47)
at com.google.appengine.tools.remoteapi.StandaloneRemoteApiDelegate$1.call(StandaloneRemoteApiDelegate.java:58)
at com.google.appengine.tools.remoteapi.StandaloneRemoteApiDelegate$1.call(StandaloneRemoteApiDelegate.java:54)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.net.SocketTimeoutException: Read timed out
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.read(SocketInputStream.java:152)
at java.net.SocketInputStream.read(SocketInputStream.java:122)
at sun.security.ssl.InputRecord.readFully(InputRecord.java:442)
at sun.security.ssl.InputRecord.read(InputRecord.java:480)
at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:934)
at sun.security.ssl.SSLSocketImpl.readDataRecord(SSLSocketImpl.java:891)
at sun.security.ssl.AppInputStream.read(AppInputStream.java:102)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:235)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:275)
at java.io.BufferedInputStream.read(BufferedInputStream.java:334)
at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:690)
at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:633)
at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1324)
at java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:468)
at sun.net.www.protocol.https.HttpsURLConnectionImpl.getResponseCode(HttpsURLConnectionImpl.java:338)
at com.google.appengine.repackaged.com.google.api.client.http.javanet.NetHttpResponse.<init>(NetHttpResponse.java:37)
at com.google.appengine.repackaged.com.google.api.client.http.javanet.NetHttpRequest.execute(NetHttpRequest.java:94)
at com.google.appengine.repackaged.com.google.api.client.http.HttpRequest.execute(HttpRequest.java:972)
at com.google.appengine.tools.remoteapi.OAuthClient.post(OAuthClient.java:54)
at com.google.appengine.tools.remoteapi.RemoteRpc.callImpl(RemoteRpc.java:102)
... 12 more
I can't figure out what the problem is, but the code seems to be evaluating the for() condition before throwing the exception.
Could this be a quota problem? The quota details screen doesn't show any problems and I can't find any relevant information in the documentation.
For future readers of this question, if you see occurrences of RemoteApiException: remote API call: I/O error which are happening consistently and not intermittently, this could be related to a disruption in network connectivity or possibly a remote issue on the App Engine side.
If the first possibility is ruled out, the best course of action is to report the issue on the Google App Engine issue tracker.
To fix this, first, check your Internet connection. Then clean all artifacts and build them again by (with IntelliJ):
Go to Build => Build Artifacts...
Focus on All Artifacts => Clean
Focus on All Artifacts => Build

Quartz clustering in camel spring DSL

I am trying to achieve "requests recovery" in fail-over scenario in two different machine with their clock also sync.
My configuration as below:
step 1: camel-context.xml
I have defined the below route in camel-context.xml file.
<route id="quartz" trace="true">
<from uri="quartz2://cluster/quartz?cron=0+0/2+++*+?&durableJob=true&stateful=true&recoverableJob=true">
<route>
step 2: quartz.properties:
I have enabled
org.quartz.jobStore.isClustered = true
org.quartz.scheduler.instanceId = AUTO
org.quartz.scheduler.instanceName =ClusteredScheduler
Currently I am running same camel application in two different instances in my local and clustering is working fine . But when I try to test the "requests recovery" I am getting below exception.
Exception :
[QuartzScheduler_ClusteredScheduler-camelContext-16308243724_ClusterManager] INFO org.quartz.impl.jdbcjobstore.JobStoreTX - ClusterManager: detected 1 failed or restarted instances.
[QuartzScheduler_ClusteredScheduler-camelContext-16308243724_ClusterManager] INFO org.quartz.impl.jdbcjobstore.JobStoreTX - ClusterManager: Scanning for instance "6308270818"'s failed in-progress jobs.
[QuartzScheduler_ClusteredScheduler-camelContext-16308243724_ClusterManager] INFO org.quartz.impl.jdbcjobstore.JobStoreTX - ClusterManager: ......Scheduled 1 recoverable job(s) for recovery.
[ClusteredScheduler-camelContext_Worker-1] WARN org.apache.camel.component.quartz2.CamelJob - Cannot find existing QuartzEndpoint with uri: quartz2://cluster/quartz?cron=0+0%2F2+*+*+*+%3F&durableJob=true&recoverableJob=true&stateful=true. Creating new endpoint instance.
[ClusteredScheduler-camelContext_Worker-1] ERROR org.apache.camel.component.quartz2.CamelJob - Failed to execute CamelJob.
**org.apache.camel.ResolveEndpointFailedException: Failed to resolve endpoint: quartz2://cluster/quartz?cron=0+0%2F2+*+*+*+%3F&durableJob=true&recoverableJob=true&stateful=true due to: Trigger key cluster.quartz is already in used by Endpoint[quartz2://cluster/quartz?cron=0+0%2F2+*+*+*+%3F&durableJob=true&recoverableJob=true&stateful=true]**
at org.apache.camel.impl.DefaultCamelContext.getEndpoint(DefaultCamelContext.java:545)
at org.apache.camel.impl.DefaultCamelContext.getEndpoint(DefaultCamelContext.java:558)
at org.apache.camel.component.quartz2.CamelJob.lookupQuartzEndpoint(CamelJob.java:123)
at org.apache.camel.component.quartz2.CamelJob.execute(CamelJob.java:49)
at org.quartz.core.JobRunShell.run(JobRunShell.java:202)
at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:573)
Caused by: java.lang.IllegalArgumentException: Trigger key cluster.quartz is already in used by Endpoint[quartz2://cluster/quartz?cron=0+0%2F2+*+*+*+%3F&durableJob=true&recoverableJob=true&stateful=true]
at org.apache.camel.component.quartz2.QuartzEndpoint.ensureNoDupTriggerKey(QuartzEndpoint.java:272)
at org.apache.camel.component.quartz2.QuartzEndpoint.addJobInScheduler(QuartzEndpoint.java:254)
at org.apache.camel.component.quartz2.QuartzEndpoint.doStart(QuartzEndpoint.java:202)
at org.apache.camel.support.ServiceSupport.start(ServiceSupport.java:61)
at org.apache.camel.impl.DefaultCamelContext.startService(DefaultCamelContext.java:2158)
at org.apache.camel.impl.DefaultCamelContext.doAddService(DefaultCamelContext.java:1016)
at org.apache.camel.impl.DefaultCamelContext.addService(DefaultCamelContext.java:977)
at org.apache.camel.impl.DefaultCamelContext.addService(DefaultCamelContext.java:973)
at org.apache.camel.impl.DefaultCamelContext.getEndpoint(DefaultCamelContext.java:541)
... 5 more
After shutting down the instance1 which is currently excuting the job , instance 2 is trying to recover the job immediately but its failing to execute the job .It is picking the same job in next interval (which is fine).
My requirement is active node immediately recover the failed job.
Thanks in advance.
I think we can avoid the checking of ensureNoDupTriggerKey, if the recoverableJob is true. I just created a JIRA CAMEL-8076 for it.

Resources