Error when trying to start Flink job from retained checkpoint - apache-flink

As I understand from the documentation, it should be possible to resume a Flink job from a checkpoint just as from a savepoint by specifing the checkpoint path in the "Savepoint path" input box of the web UI (e.g. /path/to/my/checkpoint/chk-1, where "chk-1" contains the "_metadata" file).
I've been trying this out but the I get the following exception:
2020-09-04 10:35:11
java.lang.Exception: Exception while creating StreamOperatorStateContext.
at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.streamOperatorStateContext(StreamTaskStateInitializerImpl.java:191)
at org.apache.flink.streaming.api.operators.AbstractStreamOperator.initializeState(AbstractStreamOperator.java:255)
at org.apache.flink.streaming.runtime.tasks.StreamTask.initializeStateAndOpen(StreamTask.java:1006)
at org.apache.flink.streaming.runtime.tasks.StreamTask.lambda$beforeInvoke$0(StreamTask.java:454)
at org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$SynchronizedStreamTaskActionExecutor.runThrowing(StreamTaskActionExecutor.java:94)
at org.apache.flink.streaming.runtime.tasks.StreamTask.beforeInvoke(StreamTask.java:449)
at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:461)
at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:707)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:532)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.flink.util.FlinkException: Could not restore keyed state backend for LegacyKeyedProcessOperator_632e4c67d1f4899514828b9c5059a9bb_(1/1) from any of the 1 provided restore options.
at org.apache.flink.streaming.api.operators.BackendRestorerProcedure.createAndRestore(BackendRestorerProcedure.java:135)
at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.keyedStatedBackend(StreamTaskStateInitializerImpl.java:304)
at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.streamOperatorStateContext(StreamTaskStateInitializerImpl.java:131)
... 9 more
Caused by: org.apache.flink.runtime.state.BackendBuildingException: Caught unexpected exception.
at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackendBuilder.build(RocksDBKeyedStateBackendBuilder.java:336)
at org.apache.flink.contrib.streaming.state.RocksDBStateBackend.createKeyedStateBackend(RocksDBStateBackend.java:548)
at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.lambda$keyedStatedBackend$1(StreamTaskStateInitializerImpl.java:288)
at org.apache.flink.streaming.api.operators.BackendRestorerProcedure.attemptCreateAndRestore(BackendRestorerProcedure.java:142)
at org.apache.flink.streaming.api.operators.BackendRestorerProcedure.createAndRestore(BackendRestorerProcedure.java:121)
... 11 more
Caused by: java.nio.file.NoSuchFileException: /tmp/flink-io-ee95b361-a616-4531-b402-7a21189e8ce5/job_c71cd62de3a34d90924748924e78b3f8_op_LegacyKeyedProcessOperator_632e4c67d1f4899514828b9c5059a9bb__1_1__uuid_ae7dd096-f52f-4eab-a2a3-acbfe2bc4573/336ed2fe-30a4-44b5-a419-9e485cd456a4/CURRENT
at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86)
at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
at sun.nio.fs.UnixCopyFile.copy(UnixCopyFile.java:526)
at sun.nio.fs.UnixFileSystemProvider.copy(UnixFileSystemProvider.java:253)
at java.nio.file.Files.copy(Files.java:1274)
at org.apache.flink.contrib.streaming.state.restore.RocksDBIncrementalRestoreOperation.restoreInstanceDirectoryFromPath(RocksDBIncrementalRestoreOperation.java:483)
at org.apache.flink.contrib.streaming.state.restore.RocksDBIncrementalRestoreOperation.restoreFromLocalState(RocksDBIncrementalRestoreOperation.java:218)
at org.apache.flink.contrib.streaming.state.restore.RocksDBIncrementalRestoreOperation.restoreFromRemoteState(RocksDBIncrementalRestoreOperation.java:194)
at org.apache.flink.contrib.streaming.state.restore.RocksDBIncrementalRestoreOperation.restoreWithoutRescaling(RocksDBIncrementalRestoreOperation.java:168)
at org.apache.flink.contrib.streaming.state.restore.RocksDBIncrementalRestoreOperation.restore(RocksDBIncrementalRestoreOperation.java:154)
at org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackendBuilder.build(RocksDBKeyedStateBackendBuilder.java:279)
... 15 more
Anyone has an idea of what's causing this?
UPDATE: After some tests, I noticed that this behavior depends on the state backend used. In this case I'm using RocksDBStateBackend with incremental checkpointing enabled. When I switched to FsStateBackend, the error disappeared.
Come to think of it, that would make sense since, from what I understand, checkpoints taken with incremental checkpointing enabled only record the changes compared to the previous completed checkpoint instead of the full job state, so it would not be possible to restore the job from this kind of checkpoint.
If that's correct, I think it would be useful to add a notice on the documentation (https://ci.apache.org/projects/flink/flink-docs-stable/ops/state/checkpoints.html#resuming-from-a-retained-checkpoint)

Related

Getting Frequent org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException in Flink Session mode

We were facing lot of TMs timeout exception during job restart. To fix that issue we have set TM timeout to 1 hour. We started noticing that in Flink Session mode whenever we submit job with old existing TMs, it starts giving NoSlotAvaialble exception. This is specific with Flink Session mode, It is working fine with Flink Per job mode. Any input on how can we fix that?
resourcemanager.taskmanager-timeout: 3600000
akka.ask.timeout: 300s
web.timeout: 300000
java.util.concurrent.CompletionException: java.util.concurrent.CompletionException: org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: Slot request bulk is not fulfillable! Could not allocate the required slot within slot request timeout
at org.apache.flink.runtime.scheduler.DefaultScheduler.lambda$assignResource$8(DefaultScheduler.java:515)
at java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:836)
at java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:811)
at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990)
at org.apache.flink.runtime.scheduler.SharedSlot.cancelLogicalSlotRequest(SharedSlot.java:222)

replay failure messages whenever I want in any time

#camel Hi Devs, I am currently working on a camel to transform messages from Source to Target systems, I am stuck with an issue i.e., I want redelivery my message when any exceptions occurred or due to failure caused by endpoints. I had checked the camel docs then I got a info related to Redelivery Polices It is working as per the given delay time. But my problem is used to replay messages whenever I want. For example last year there are some messages which got failure those payloads are stored in my system. So I want to replay those messages this year. like Replay. Can any devs help me on this cause? Thanks.
You can simply use the DeadLetterChannel EIP (https://camel.apache.org/components/3.16.x/eips/dead-letter-channel.html).
This will put your failed messages in a special channel (typically a persistant JMS queue). To replay a message, you only have to move it from the DLC ("myqueue.dead") to the original queue ("myqueue") .

StreamingFileSink fails to start if an MPU is missing in S3

We are using StreamingFileSink in Flink 1.11 (AWS KDA) to write data from Kafka to S3.
Sometimes, even after a proper stopping of the application it will fail to start with:
com.amazonaws.services.s3.model.AmazonS3Exception: The specified upload does not exist. The upload ID may be invalid, or the upload may have been aborted or completed.
By looking at the code I can see that files are moved from in-progress to pending during a checkpoint: files are synced to S3 as MPU uploads and a _tmp_ objects when upload part is too small.
However, pending files are committed during notifyCheckpointComplete, after the checkpoint is done.
streamingFileSink will fail with the error above when an MPU which it has in state does not exists in S3.
Would the following scenario be possible:
Checkpoint is taken and files are transitioned into pending state.
notifyCheckpointComplete is called and it starts to complete the MPUs
Application is suddenly killed or even just stopped as part of a shutdown.
Checkpointed state still has information about MPUs, but if you try to restore from it it's not going to find them, because they were completed outside of checkpoint and not part of the state.
Would it better to ignore missing MPUs and _tmp_ files? Or make it an option?
This way the above situation would not happen and it would allow to restore from arbitrary checkpoints/savepoints.
The Streaming File Sink in Flink has been superseded by the new File Sink implementation since Flink 1.12. This is using the Unified Sink API for both batch and streaming use cases. I don't think the problem you've listed here would occur when using this implementation.

SolrCloud: Underlying file changed by external force?

Having trouble with an issue I am unable to reproduce in solrcloud. Seems to happen at random.
Underlying file changed by an external force at 2018-09-18T14:55:22Z, (lock=NativeFSLock(path=/path/to/my/shard/index/write.lock,impl=sun.nio.ch.FileLockImpl[0:9223372036854775807 exclusive valid],creationTime=2018-09-18T14:55:22.006973Z)) Caused by: org.apache.lucene.store.AlreadyClosedException: Underlying file changed by an external force at 2018-09-18T14:55:21Z,
the index is schemaless and typically receives many simultaneous updates.
It usually starts with something like this: (in this order before the write.lock error occurs)
Error from server at server3/solr/myshard: Bad Request
Remote error message: Exception writing document id bc04df6e-f29f-4091-ad73-f708a97d28b4 to the index; possible analysis error.
3 Async exceptions during distributed update:
Remote error message: this IndexWriter is closed
Is there anyway to self recover from the indexwriter being closed? After the error occurs, no more documents can be written to the collection. The only solution I have right now is to either delete the write.lock and restart solr or recreate the collection completely.

Reconnect database in Django

I'm testing an import script on a shared web host I just got, but I found that transactions are blocked after running it for 20 minutes or so. I assume this is to avoid overloading the database, but even when I import one item every 1 second, I still run into the problem. To be specific, when I try to save an object I receive the error:
DatabaseError: current transaction is aborted, commands ignored until end of transaction block
I've tried to delay for a few hours after this happens, but there is still a block. The only way to resume importing is to completely restart the importing program. Because of this, I reasoned that all I need to do is reconnect to the DB. This might not be true, but it's wroth a try.
So my question is this, how can I disconnect and reconnect the DB connection in Django? Is this possible?
Most likely some other database error occurred before this one, but your code ignored it and went forward with the transaction in a broken state.

Resources