Flink Task Manager Suddenly Crashed - apache-flink

Flink TM suddenly got crashed after 3 months of running with the below error stack trace.
2021-12-05 07:22:05,369 WARN org.apache.flink.runtime.taskmanager.Task [] - Task 'GlobalWindowAggregate(groupBy=[org, $f4], window=[HOP(slice_end=[$slice_end], size=[15 min], slide=[1 min])], select=[org, $f4, COUNT(distinct$0 count$0) AS $f2, COUNT(count1$1) AS window_start, start('w$) AS window_end]) -> Calc(select=[window_start, window_end, org, $f4, $f2 AS $f4_0]) (1/24)#6' did not react to cancelling signal for 30 seconds, but is stuck in method:
org.apache.flink.runtime.io.network.partition.consumer.BufferManager.notifyBufferAvailable(BufferManager.java:296)
org.apache.flink.runtime.io.network.buffer.LocalBufferPool.fireBufferAvailableNotification(LocalBufferPool.java:507)
org.apache.flink.runtime.io.network.buffer.LocalBufferPool.recycle(LocalBufferPool.java:494)
org.apache.flink.runtime.io.network.buffer.LocalBufferPool.recycle(LocalBufferPool.java:460)
org.apache.flink.runtime.io.network.buffer.NetworkBuffer.deallocate(NetworkBuffer.java:182)
org.apache.flink.shaded.netty4.io.netty.buffer.AbstractReferenceCountedByteBuf.handleRelease(AbstractReferenceCountedByteBuf.java:110)
org.apache.flink.shaded.netty4.io.netty.buffer.AbstractReferenceCountedByteBuf.release(AbstractReferenceCountedByteBuf.java:100)
org.apache.flink.runtime.io.network.buffer.NetworkBuffer.recycleBuffer(NetworkBuffer.java:156)
org.apache.flink.runtime.io.network.partition.consumer.BufferManager$AvailableBufferQueue.addExclusiveBuffer(BufferManager.java:399)
org.apache.flink.runtime.io.network.partition.consumer.BufferManager.recycle(BufferManager.java:200)
org.apache.flink.runtime.io.network.buffer.NetworkBuffer.deallocate(NetworkBuffer.java:182)
org.apache.flink.shaded.netty4.io.netty.buffer.AbstractReferenceCountedByteBuf.handleRelease(AbstractReferenceCountedByteBuf.java:110)
org.apache.flink.shaded.netty4.io.netty.buffer.AbstractReferenceCountedByteBuf.release(AbstractReferenceCountedByteBuf.java:100)
org.apache.flink.runtime.io.network.buffer.NetworkBuffer.recycleBuffer(NetworkBuffer.java:156)
org.apache.flink.runtime.io.network.api.serialization.SpillingAdaptiveSpanningRecordDeserializer.getNextRecord(SpillingAdaptiveSpanningRecordDeserializer.java:95)
org.apache.flink.streaming.runtime.io.AbstractStreamTaskNetworkInput.emitNext(AbstractStreamTaskNetworkInput.java:95)
org.apache.flink.streaming.runtime.io.StreamOneInputProcessor.processInput(StreamOneInputProcessor.java:66)
org.apache.flink.streaming.runtime.tasks.StreamTask.processInput(StreamTask.java:423)
org.apache.flink.streaming.runtime.tasks.StreamTask$$Lambda$615/1465249724.runDefaultAction(Unknown Source)
org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMailboxLoop(MailboxProcessor.java:204)
org.apache.flink.streaming.runtime.tasks.StreamTask.runMailboxLoop(StreamTask.java:681)
org.apache.flink.streaming.runtime.tasks.StreamTask.executeInvoke(StreamTask.java:636)
org.apache.flink.streaming.runtime.tasks.StreamTask$$Lambda$1480/994476387.run(Unknown Source)
org.apache.flink.streaming.runtime.tasks.StreamTask.runWithCleanUpOnFail(StreamTask.java:647)
org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:620)
org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:779)
org.apache.flink.runtime.taskmanager.Task.run(Task.java:566)
java.lang.Thread.run(Thread.java:748)
2021-12-05 07:22:05,370 INFO org.apache.flink.runtime.taskexecutor.slot.TaskSlotTableImpl [] - Free slot TaskSlot(index:7, state:ALLOCATED, resource profile: ResourceProfile{cpuCores=2.0000000000000000, taskHeapMemory=2.656gb (2852126690 bytes), taskOffHeapMemory=0 bytes, managedMemory=1.875gb (2013265950 bytes), networkMemory=128.000mb (134217728 bytes)}, allocationId: 2b2d5beb481130d88a1eaaa0d3be2f7d, jobId: a5ed6a11efac85d315195eb9e7534316).
2021-12-05 07:22:05,370 INFO org.apache.flink.runtime.taskmanager.Task [] - Attempting to fail task externally GlobalWindowAggregate(groupBy=[org, $f4], window=[HOP(slice_end=[$slice_end], size=[15 min], slide=[1 min])], select=[org, $f4, COUNT(distinct$0 count$0) AS $f2, COUNT(count1$1) AS window_start, start('w$) AS window_end]) -> Calc(select=[window_start, window_end, org, $f4, $f2 AS $f4_0]) (1/24)#6 (5e34a8de7bcff882f37c073f250c2594).
2021-12-05 07:22:05,370 INFO org.apache.flink.runtime.taskmanager.Task [] - Task GlobalWindowAggregate(groupBy=[org, $f4], window=[HOP(slice_end=[$slice_end], size=[15 min], slide=[1 min])], select=[org, $f4, COUNT(distinct$0 count$0) AS $f2, COUNT(count1$1) AS window_start, start('w$) AS window_end]) -> Calc(select=[window_start, window_end, org, $f4, $f2 AS $f4_0]) (1/24)#6 is already in state CANCELING
2021-12-05 07:22:05,372 INFO org.apache.flink.runtime.taskexecutor.slot.TaskSlotTableImpl [] - Free slot TaskSlot(index:7, state:RELEASING, resource profile: ResourceProfile{cpuCores=2.0000000000000000, taskHeapMemory=2.656gb (2852126690 bytes), taskOffHeapMemory=0 bytes, managedMemory=1.875gb (2013265950 bytes), networkMemory=128.000mb (134217728 bytes)}, allocationId: 2b2d5beb481130d88a1eaaa0d3be2f7d, jobId: a5ed6a11efac85d315195eb9e7534316).
2021-12-05 07:22:15,362 ERROR org.apache.flink.runtime.taskexecutor.TaskManagerRunner [] - Terminating TaskManagerRunner with exit code 1.
org.apache.flink.util.FlinkException: Unexpected failure during runtime of TaskManagerRunner.
at org.apache.flink.runtime.taskexecutor.TaskManagerRunner.runTaskManager(TaskManagerRunner.java:382) ~[flink-dist_2.12-1.13.1.jar:1.13.1]
at org.apache.flink.runtime.taskexecutor.TaskManagerRunner.lambda$runTaskManagerProcessSecurely$3(TaskManagerRunner.java:413) ~[flink-dist_2.12-1.13.1.jar:1.13.1]
at org.apache.flink.runtime.security.contexts.NoOpSecurityContext.runSecured(NoOpSecurityContext.java:28) ~[flink-dist_2.12-1.13.1.jar:1.13.1]
at org.apache.flink.runtime.taskexecutor.TaskManagerRunner.runTaskManagerProcessSecurely(TaskManagerRunner.java:413) [flink-dist_2.12-1.13.1.jar:1.13.1]
at org.apache.flink.runtime.taskexecutor.TaskManagerRunner.runTaskManagerProcessSecurely(TaskManagerRunner.java:396) [flink-dist_2.12-1.13.1.jar:1.13.1]
at org.apache.flink.runtime.taskexecutor.TaskManagerRunner.main(TaskManagerRunner.java:354) [flink-dist_2.12-1.13.1.jar:1.13.1]
Caused by: java.util.concurrent.TimeoutException
at org.apache.flink.runtime.concurrent.FutureUtils$Timeout.run(FutureUtils.java:1255) ~[flink-dist_2.12-1.13.1.jar:1.13.1]
at org.apache.flink.runtime.concurrent.DirectExecutorService.execute(DirectExecutorService.java:217) ~[flink-dist_2.12-1.13.1.jar:1.13.1]
at org.apache.flink.runtime.concurrent.FutureUtils.lambda$orTimeout$15(FutureUtils.java:582) ~[flink-dist_2.12-1.13.1.jar:1.13.1]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) ~[?:1.8.0_232]
at java.util.concurrent.FutureTask.run(FutureTask.java:266) ~[?:1.8.0_232]
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) ~[?:1.8.0_232]
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) ~[?:1.8.0_232]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[?:1.8.0_232]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[?:1.8.0_232]
at java.lang.Thread.run(Thread.java:748) ~[?:1.8.0_232]
2021-12-05 07:22:15,365 INFO org.apache.flink.runtime.blob.TransientBlobCache [] - Shutting down BLOB cache
2021-12-05 07:22:15,365 INFO org.apache.flink.runtime.blob.PermanentBlobCache [] - Shutting down BLOB cache
2021-12-05 07:22:15,365 INFO org.apache.flink.runtime.state.TaskExecutorLocalStateStoresManager [] - Shutting down TaskExecutorLocalStateStoresManager.
2021-12-05 07:22:15,365 INFO org.apache.flink.runtime.filecache.FileCache [] - removed file cache directory /tmp/flink-dist-cache-9fad861a-b657-4625-a184-db126c423c2f
While debugging, I found Input and output buffer usage reached 100% usage on datadog dashboard.
Also found out that last 2 checkpoints got failed with message Checkpoint expired before completing. Checkpoint timeout is 2 mins.
How can I fix this issue.

Checkpoint timeouts are generally caused by either
backpressure causing the checkpoint barriers to progress too slowly across the execution graph, or
some sort of bottleneck preventing Flink from writing fast enough to the checkpoint storage (e.g., network starvation, insufficient iops quota)
It looks like you are using unaligned checkpointing. This should help with point number 1 above, but could be causing point number 2 to be a problem, since unaligned checkpoints increase the amount of data being checkpointed (by up to about a 1GB in your case, it looks like).
You might just want to increase the checkpoint timeout. Having checkpoints timeout is almost never helpful.
But it also appears that you have significant backpressure. Figuring out what's causing that and doing something about it should help. (If you can upgrade to Flink 1.13 (or later) the improved backpressure monitoring will make this easier.) Perhaps you have data skew, or perhaps you need to scale up the cluster.

Related

Recovery from checkpoint fails with EOFException

This is the stacktrace:
Caused by: java.util.concurrent.CompletionException: java.lang.RuntimeException: java.io.EOFException
at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:273)
at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:280)
at java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1606)
... 3 common frames omitted
Caused by: java.lang.RuntimeException: java.io.EOFException
at org.apache.flink.util.ExceptionUtils.rethrow(ExceptionUtils.java:316)
at org.apache.flink.util.function.FunctionUtils.lambda$uncheckedSupplier$4(FunctionUtils.java:114)
at java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1604)
... 3 common frames omitted
Caused by: java.io.EOFException: null
at java.io.DataInputStream.readUnsignedShort(DataInputStream.java:340)
at java.io.DataInputStream.readUTF(DataInputStream.java:589)
at java.io.DataInputStream.readUTF(DataInputStream.java:564)
at org.apache.flink.runtime.checkpoint.metadata.MetadataV2V3SerializerBase.deserializeStreamStateHandleMap(MetadataV2V3SerializerBase.java:730)
at org.apache.flink.runtime.checkpoint.metadata.MetadataV2V3SerializerBase.deserializeKeyedStateHandle(MetadataV2V3SerializerBase.java:408)
at org.apache.flink.runtime.checkpoint.metadata.MetadataV2V3SerializerBase.deserializeSubtaskState(MetadataV2V3SerializerBase.java:269)
at org.apache.flink.runtime.checkpoint.metadata.MetadataV3Serializer.deserializeOperatorState(MetadataV3Serializer.java:183)
at org.apache.flink.runtime.checkpoint.metadata.MetadataV2V3SerializerBase.deserializeMetadata(MetadataV2V3SerializerBase.java:164)
at org.apache.flink.runtime.checkpoint.metadata.MetadataV3Serializer.deserialize(MetadataV3Serializer.java:89)
at org.apache.flink.runtime.checkpoint.Checkpoints.loadCheckpointMetadata(Checkpoints.java:110)
at org.apache.flink.runtime.checkpoint.Checkpoints.loadAndValidateCheckpoint(Checkpoints.java:140)
at org.apache.flink.runtime.checkpoint.CheckpointCoordinator.restoreSavepoint(CheckpointCoordinator.java:1648)
at org.apache.flink.runtime.scheduler.DefaultExecutionGraphFactory.tryRestoreExecutionGraphFromSavepoint(DefaultExecutionGraphFactory.java:163)
at org.apache.flink.runtime.scheduler.DefaultExecutionGraphFactory.createAndRestoreExecutionGraph(DefaultExecutionGraphFactory.java:138)
at org.apache.flink.runtime.scheduler.SchedulerBase.createAndRestoreExecutionGraph(SchedulerBase.java:335)
at org.apache.flink.runtime.scheduler.SchedulerBase.<init>(SchedulerBase.java:191)
at org.apache.flink.runtime.scheduler.DefaultScheduler.<init>(DefaultScheduler.java:140)
at org.apache.flink.runtime.scheduler.DefaultSchedulerFactory.createInstance(DefaultSchedulerFactory.java:134)
at org.apache.flink.runtime.jobmaster.DefaultSlotPoolServiceSchedulerFactory.createScheduler(DefaultSlotPoolServiceSchedulerFactory.java:110)
at org.apache.flink.runtime.jobmaster.JobMaster.createScheduler(JobMaster.java:346)
at org.apache.flink.runtime.jobmaster.JobMaster.<init>(JobMaster.java:323)
at org.apache.flink.runtime.jobmaster.factories.DefaultJobMasterServiceFactory.internalCreateJobMasterService(DefaultJobMasterServiceFactory.java:106)
at org.apache.flink.runtime.jobmaster.factories.DefaultJobMasterServiceFactory.lambda$createJobMasterService$0(DefaultJobMasterServiceFactory.java:94)
at org.apache.flink.util.function.FunctionUtils.lambda$uncheckedSupplier$4(FunctionUtils.java:112)
... 4 common frames omitted
Is the checkpoint corrupt?
When I debug, it looks like flink expects a certain number of StreamStateHandle to be found, but the stream ends before reading them all.
Does flink has some mechanism to ignore the checkpoint when it failed to recover from it? The process fails because of the checkpoint. Is there a way to recover from this failure?
Thanks

Call to RocksDB.get times out when FlinkSQL job restarts

We are running a streaming FlinkSQL job. If the job restarts (say due to a checkpoint failure), task managers log this error (see full stack trace below):
did not react to cancelling signal - interrupting; it is stuck for 30 seconds in method:
app//org.rocksdb.RocksDB.get(Native Method)
We see from the logs that rocksdb is closed exactly 30 seconds earlier:
Closed RocksDB State Backend. Cleaning up RocksDB working directory /flink-tmp/rocksdb/job_90331a967b94c5abd7b5377a55cc67ac_op_SlicingWindowOperator_4808b9a6cd8a2889e00c15fe1a792329__17_50__uuid_eb1b282c-8952-4b9f-b6d6-ee7be011d59f.
Is closing rocksdb causing the get operation to not return?
We are using Flink 1.15.0, and running a query like this:
INSERT INTO BigtableTable
SELECT CONCAT_WS('#', user_id, bucket) as rowkey, cell_timestamp, ROW(hllAttributeCount)
FROM (
SELECT
user_id,
window_end as cell_timestamp,
DATE_FORMAT(window_end, 'yyyy-MM-dd:HH') AS bucket,
STRING_HLL(attribute_to_count) AS hllAttributeCount
FROM TABLE(TUMBLE(TABLE inputTable, DESCRIPTOR(event_time), INTERVAL '5' MINUTES))
GROUP BY user_Id, window_start, window_end)
Full stack trace:
Task 'GlobalWindowAggregate[5] -> Calc[6] -> Sink: table[7] (4/50)#0' did not react to cancelling signal - interrupting; it is stuck for 30 seconds in method:
app//org.rocksdb.RocksDB.get(Native Method)
app//org.rocksdb.RocksDB.get(RocksDB.java:2084)
app//org.apache.flink.contrib.streaming.state.RocksDBValueState.value(RocksDBValueState.java:83)
app//org.apache.flink.table.runtime.operators.window.state.WindowValueState.value(WindowValueState.java:44)
app//org.apache.flink.table.runtime.operators.aggregate.window.combines.GlobalAggCombiner.combineAccumulator(GlobalAggCombiner.java:94)
app//org.apache.flink.table.runtime.operators.aggregate.window.combines.GlobalAggCombiner.combine(GlobalAggCombiner.java:85)
app//org.apache.flink.table.runtime.operators.aggregate.window.buffers.RecordsWindowBuffer.flush(RecordsWindowBuffer.java:112)
app//org.apache.flink.table.runtime.operators.aggregate.window.processors.AbstractWindowAggProcessor.prepareCheckpoint(AbstractWindowAggProcessor.java:203)
app//org.apache.flink.table.runtime.operators.window.slicing.SlicingWindowOperator.prepareSnapshotPreBarrier(SlicingWindowOperator.java:267)
app//org.apache.flink.streaming.runtime.tasks.RegularOperatorChain.prepareSnapshotPreBarrier(RegularOperatorChain.java:89)
app//org.apache.flink.streaming.runtime.tasks.SubtaskCheckpointCoordinatorImpl.checkpointState(SubtaskCheckpointCoordinatorImpl.java:300)
app//org.apache.flink.streaming.runtime.tasks.StreamTask.lambda$performCheckpoint$12(StreamTask.java:1253)
app//org.apache.flink.streaming.runtime.tasks.StreamTask$$Lambda$1997/0x0000000840efa440.run(Unknown Source)
app//org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$1.runThrowing(StreamTaskActionExecutor.java:50)
app//org.apache.flink.streaming.runtime.tasks.StreamTask.performCheckpoint(StreamTask.java:1241)
app//org.apache.flink.streaming.runtime.tasks.StreamTask.triggerCheckpointOnBarrier(StreamTask.java:1198)
app//org.apache.flink.streaming.runtime.io.checkpointing.CheckpointBarrierHandler.notifyCheckpoint(CheckpointBarrierHandler.java:147)
app//org.apache.flink.streaming.runtime.io.checkpointing.SingleCheckpointBarrierHandler.triggerCheckpoint(SingleCheckpointBarrierHandler.java:287)
app//org.apache.flink.streaming.runtime.io.checkpointing.SingleCheckpointBarrierHandler.access$100(SingleCheckpointBarrierHandler.java:64)
app//org.apache.flink.streaming.runtime.io.checkpointing.SingleCheckpointBarrierHandler$ControllerImpl.triggerGlobalCheckpoint(SingleCheckpointBarrierHandler.java:493)
app//org.apache.flink.streaming.runtime.io.checkpointing.AbstractAlignedBarrierHandlerState.triggerGlobalCheckpoint(AbstractAlignedBarrierHandlerState.java:74)
app//org.apache.flink.streaming.runtime.io.checkpointing.AbstractAlignedBarrierHandlerState.barrierReceived(AbstractAlignedBarrierHandlerState.java:66)
app//org.apache.flink.streaming.runtime.io.checkpointing.SingleCheckpointBarrierHandler.lambda$processBarrier$2(SingleCheckpointBarrierHandler.java:234)
app//org.apache.flink.streaming.runtime.io.checkpointing.SingleCheckpointBarrierHandler$$Lambda$2010/0x0000000840efd040.apply(Unknown Source)
app//org.apache.flink.streaming.runtime.io.checkpointing.SingleCheckpointBarrierHandler.markCheckpointAlignedAndTransformState(SingleCheckpointBarrierHandler.java:262)
app//org.apache.flink.streaming.runtime.io.checkpointing.SingleCheckpointBarrierHandler.processBarrier(SingleCheckpointBarrierHandler.java:231)
app//org.apache.flink.streaming.runtime.io.checkpointing.CheckpointedInputGate.handleEvent(CheckpointedInputGate.java:181)
app//org.apache.flink.streaming.runtime.io.checkpointing.CheckpointedInputGate.pollNext(CheckpointedInputGate.java:159)
app//org.apache.flink.streaming.runtime.io.AbstractStreamTaskNetworkInput.emitNext(AbstractStreamTaskNetworkInput.java:110)
app//org.apache.flink.streaming.runtime.io.StreamOneInputProcessor.processInput(StreamOneInputProcessor.java:65)
app//org.apache.flink.streaming.runtime.tasks.StreamTask.processInput(StreamTask.java:519)
app//org.apache.flink.streaming.runtime.tasks.StreamTask$$Lambda$1068/0x00000008409dfc40.runDefaultAction(Unknown Source)
app//org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMailboxLoop(MailboxProcessor.java:203)
app//org.apache.flink.streaming.runtime.tasks.StreamTask.runMailboxLoop(StreamTask.java:804)
app//org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:753)
app//org.apache.flink.runtime.taskmanager.Task$$Lambda$1951/0x0000000840e47840.run(Unknown Source)
app//org.apache.flink.runtime.taskmanager.Task.runWithSystemExitMonitoring(Task.java:948)
app//org.apache.flink.runtime.taskmanager.Task.restoreAndInvoke(Task.java:927)
app//org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:741)
app//org.apache.flink.runtime.taskmanager.Task.run(Task.java:563)
java.base#11.0.15/java.lang.Thread.run(Unknown Source)

Problem during state restore; when Flink job is submitted

We are getting the exception, copied at the end of this post. The exception is thrown when a new flink job is submitted; when Flink tries to restore the previous state.
Environment:
Flink version: 1.10.1
State persistence: Hadoop 3.3
Zookeeper 3.5.8
Parallelism: 4
The code implements DataStream Transformation functions: ProcessFunction -> KeySelector -> ProcessFunction. Inbound messages are partitioned by key "sourceId" which is a part of the exception stack trace. SourceId is String type and is unique.
Caused by: com.esotericsoftware.kryo.KryoException: java.lang.IndexOutOfBoundsException: Index: 109, Size: 10
Serialization trace:
sourceId (com.contineo.ext.flink.core.ThingState)
We have overridden "org.apache.flink.streaming.api.functions.ProcessFunction.open()" method
Any help is appreciated
Exception stack trace:
2021-01-19 19:59:56,934 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Checkpoint triggering task Source: Custom Source -> Process -> Process (3/4) of job c957f40043721b5cab3161991999a7ed is not in state RUNNING but DEPLOYING instead. Aborting checkpoint.
2021-01-19 19:59:57,358 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Process -> Sink: Unnamed (4/4) (b2605627c2fffc83dd412b3e7565244d) switched from RUNNING to FAILED.
java.lang.Exception: Exception while creating StreamOperatorStateContext.
at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.streamOperatorStateContext(StreamTaskStateInitializerImpl.java:191)
at org.apache.flink.streaming.api.operators.AbstractStreamOperator.initializeState(AbstractStreamOperator.java:255)
at org.apache.flink.streaming.runtime.tasks.StreamTask.initializeStateAndOpen(StreamTask.java:989)
at org.apache.flink.streaming.runtime.tasks.StreamTask.lambda$beforeInvoke$0(StreamTask.java:453)
at org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$SynchronizedStreamTaskActionExecutor.runThrowing(StreamTaskActionExecutor.java:94)
at org.apache.flink.streaming.runtime.tasks.StreamTask.beforeInvoke(StreamTask.java:448)
at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:460)
at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:708)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:533)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.flink.util.FlinkException: Could not restore keyed state backend for LegacyKeyedProcessOperator_c27dcf7b54ef6bfd6cff02ca8870b681_(4/4) from any of the 1 provided restore options.
at org.apache.flink.streaming.api.operators.BackendRestorerProcedure.createAndRestore(BackendRestorerProcedure.java:135)
at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.keyedStatedBackend(StreamTaskStateInitializerImpl.java:304)
at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.streamOperatorStateContext(StreamTaskStateInitializerImpl.java:131)
... 9 more
Caused by: org.apache.flink.runtime.state.BackendBuildingException: Failed when trying to restore heap backend
at org.apache.flink.runtime.state.heap.HeapKeyedStateBackendBuilder.build(HeapKeyedStateBackendBuilder.java:116)
at org.apache.flink.runtime.state.filesystem.FsStateBackend.createKeyedStateBackend(FsStateBackend.java:529)
at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.lambda$keyedStatedBackend$1(StreamTaskStateInitializerImpl.java:288)
at org.apache.flink.streaming.api.operators.BackendRestorerProcedure.attemptCreateAndRestore(BackendRestorerProcedure.java:142)
at org.apache.flink.streaming.api.operators.BackendRestorerProcedure.createAndRestore(BackendRestorerProcedure.java:121)
... 11 more
Caused by: com.esotericsoftware.kryo.KryoException: java.lang.IndexOutOfBoundsException: Index: 109, Size: 10
Serialization trace:
sourceId (com.contineo.ext.flink.core.ThingState)
at com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:125)
at com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:528)
at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:761)
at org.apache.flink.api.java.typeutils.runtime.kryo.KryoSerializer.deserialize(KryoSerializer.java:346)
at org.apache.flink.runtime.state.heap.StateTableByKeyGroupReaders.lambda$createV2PlusReader$0(StateTableByKeyGroupReaders.java:77)
at org.apache.flink.runtime.state.KeyGroupPartitioner$PartitioningResultKeyGroupReader.readMappingsInKeyGroup(KeyGroupPartitioner.java:297)
at org.apache.flink.runtime.state.heap.HeapRestoreOperation.readKeyGroupStateData(HeapRestoreOperation.java:293)
at org.apache.flink.runtime.state.heap.HeapRestoreOperation.readStateHandleStateData(HeapRestoreOperation.java:254)
at org.apache.flink.runtime.state.heap.HeapRestoreOperation.restore(HeapRestoreOperation.java:154)
at org.apache.flink.runtime.state.heap.HeapKeyedStateBackendBuilder.build(HeapKeyedStateBackendBuilder.java:114)
... 15 more
Caused by: java.lang.IndexOutOfBoundsException: Index: 109, Size: 10
at java.util.ArrayList.rangeCheck(ArrayList.java:659)
at java.util.ArrayList.get(ArrayList.java:435)
at com.esotericsoftware.kryo.util.MapReferenceResolver.getReadObject(MapReferenceResolver.java:42)
at com.esotericsoftware.kryo.Kryo.readReferenceOrNull(Kryo.java:805)
at com.esotericsoftware.kryo.Kryo.readObjectOrNull(Kryo.java:728)
at com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:113)
... 24 more

flink Connection reset by peer

I have a Flink Streaming job, it failed and I got the log as below.Can anyone tell me how to solve the problem?
It sometimes failed after one day running, and sometimes failed after a few hours.
09:30:25 948 INFO (org.apache.flink.runtime.executiongraph.ExecutionGraph:1240) - TriggerWindow(TumblingProcessingTimeWindows(600000), ListStateDescriptor{serializer=org.apache.flink.api.common.typeutils.base.ListSerializer#ece0f926}, ProcessingTimeTrigger(), WindowedStream.process(WindowedStream.scala:563)) -> Filter -> Filter -> Map (40/48) (19ea993ced2b161422c345c9b633853a) switched from RUNNING to FAILED.
org.apache.flink.runtime.io.network.netty.exception.RemoteTransportException: Lost connection to task manager . This indicates that the remote task manager was lost.
at org.apache.flink.runtime.io.network.netty.PartitionRequestClientHandler.exceptionCaught(PartitionRequestClientHandler.java:146)
at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:275)
at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.fireExceptionCaught(AbstractChannelHandlerContext.java:253)
at org.apache.flink.shaded.netty4.io.netty.channel.ChannelInboundHandlerAdapter.exceptionCaught(ChannelInboundHandlerAdapter.java:131)
at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:275)
at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.fireExceptionCaught(AbstractChannelHandlerContext.java:253)
at org.apache.flink.shaded.netty4.io.netty.channel.ChannelInboundHandlerAdapter.exceptionCaught(ChannelInboundHandlerAdapter.java:131)
at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:275)
at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.fireExceptionCaught(AbstractChannelHandlerContext.java:253)
at org.apache.flink.shaded.netty4.io.netty.channel.ChannelHandlerAdapter.exceptionCaught(ChannelHandlerAdapter.java:79)
at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:275)
at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.fireExceptionCaught(AbstractChannelHandlerContext.java:253)
at org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline.fireExceptionCaught(DefaultChannelPipeline.java:835)
at org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.handleReadException(AbstractNioByteChannel.java:87)
at org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:162)
at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
at org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: Connection reset by peer
at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
at sun.nio.ch.IOUtil.read(IOUtil.java:192)
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380)
at org.apache.flink.shaded.netty4.io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:311)
at org.apache.flink.shaded.netty4.io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:881)
at org.apache.flink.shaded.netty4.io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:241)
at org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:119)
... 6 more
I ended up finding the root cause in job manager log:
- Closing TaskExecutor connection container_e06_1554425226316_0158_01_000024 because: Container [pid=14446,containerID=container_e06_1554425226316_0158_01_000024] is running beyond physical memory limits. Current usage: 12.5 GB of 12.5 GB physical memory used; 14.7 GB of 26.2 GB virtual memory used. Killing container.
so I increased TM memory

is there any way to fix solr index

I am running a program that crawls the web and saves data into a solr index. for mysterious reasons, the solr server crashed. And now I end up with a corrupted index that has no segment files and hence risking losing all my data collected for 5 days....
The error message reads as below when you try to search on this index. the index folder definitely has data, as it has 182 files and 2GB in size.
I have tried to use CheckIndex but get the same error about no segment files...
java.util.concurrent.ExecutionException: org.apache.solr.common.SolrException: Unable to create core [chase]
at java.util.concurrent.FutureTask.report(FutureTask.java:122)
at java.util.concurrent.FutureTask.get(FutureTask.java:192)
at org.apache.solr.core.CoreContainer.lambda$load$6(CoreContainer.java:586)
at com.codahale.metrics.InstrumentedExecutorService$InstrumentedRunnable.run(InstrumentedExecutorService.java:176)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:229)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.solr.common.SolrException: Unable to create core [chase]
at org.apache.solr.core.CoreContainer.create(CoreContainer.java:935)
at org.apache.solr.core.CoreContainer.lambda$load$5(CoreContainer.java:558)
at com.codahale.metrics.InstrumentedExecutorService$InstrumentedCallable.call(InstrumentedExecutorService.java:197)
... 5 more
Caused by: org.apache.solr.common.SolrException: Error opening new searcher
at org.apache.solr.core.SolrCore.<init>(SolrCore.java:977)
at org.apache.solr.core.SolrCore.<init>(SolrCore.java:830)
at org.apache.solr.core.CoreContainer.create(CoreContainer.java:920)
... 7 more
Caused by: org.apache.solr.common.SolrException: Error opening new searcher
at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:2069)
at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:2189)
at org.apache.solr.core.SolrCore.initSearcher(SolrCore.java:1071)
at org.apache.solr.core.SolrCore.<init>(SolrCore.java:949)
... 9 more
Caused by: org.apache.lucene.index.IndexNotFoundException: no segments* file found in LockValidatingDirectoryWrapper(NRTCachingDirectory(MMapDirectory#/home/zqz/Work/chase/aws/data/solr/chase/data/index lockFactory=org.apache.lucene.store.NativeFSLockFactory#51b2fc7e; maxCacheMB=48.0 maxMergeSizeMB=4.0)): files: [_fh2.fdt, _fh2.fdx, _fh2.fnm, _fh2.nvd, _fh2.nvm, _fh2.si, _fh2_Lucene50_0.doc, _fh2_Lucene50_0.pos, _fh2_Lucene50_0.tim, _fh2_Lucene50_0.tip, write.lock]
at org.apache.lucene.index.IndexWriter.<init>(IndexWriter.java:925)
at org.apache.solr.update.SolrIndexWriter.<init>(SolrIndexWriter.java:118)
at org.apache.solr.update.SolrIndexWriter.create(SolrIndexWriter.java:93)
at org.apache.solr.update.DefaultSolrCoreState.createMainIndexWriter(DefaultSolrCoreState.java:248)
at org.apache.solr.update.DefaultSolrCoreState.getIndexWriter(DefaultSolrCoreState.java:122)
at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:2030)
... 12 more
2017-06-20 14:38:52.428 INFO (qtp475266352-16) [ ] o.a.s.c.TransientSolrCoreCacheDefault Allocating transient cache for 2147483647 transient cores
2017-06-20 14:38:52.894 INFO (qtp475266352-13) [ ] o.a.s.s.HttpSolrCall [admin] webapp=null path=/admin/cores params={indexInfo=false&wt=json&_=1497969532681} status=0 QTime=11
2017-06-20 14:38:52.962 INFO (qtp475266352-20) [ ] o.a.s.s.HttpSolrCall [admin] webapp=null path=/admin/info/system params={wt=json&_=1497969532684} status=0 QTime=76
The error you mentioned is caused by the missing file :
segments* e.g. segments_3 ...
in the index files :
files: [_fh2.fdt, _fh2.fdx, _fh2.fnm, _fh2.nvd, _fh2.nvm, _fh2.si, _fh2_Lucene50_0.doc, _fh2_Lucene50_0.pos, _fh2_Lucene50_0.tim, _fh2_Lucene50_0.tip, write.lock]
That file specifies the last commit point and the last generation of segments to take into account and apparently it is missing.
Check if that file is there and is readable.
If it is not ( because for example the index writer was not closed properly due to the mulfuction, do not despair.
Chances are there that the transaction log contains still the documents you indexed, so you could just replay it and get the documents back ( clean the index dir, make solr starting and it should take care).
Solr allows also a backup functionality, so for the future you may want to configure it.

Resources