Recovery from checkpoint fails with EOFException - apache-flink

This is the stacktrace:
Caused by: java.util.concurrent.CompletionException: java.lang.RuntimeException: java.io.EOFException
at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:273)
at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:280)
at java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1606)
... 3 common frames omitted
Caused by: java.lang.RuntimeException: java.io.EOFException
at org.apache.flink.util.ExceptionUtils.rethrow(ExceptionUtils.java:316)
at org.apache.flink.util.function.FunctionUtils.lambda$uncheckedSupplier$4(FunctionUtils.java:114)
at java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1604)
... 3 common frames omitted
Caused by: java.io.EOFException: null
at java.io.DataInputStream.readUnsignedShort(DataInputStream.java:340)
at java.io.DataInputStream.readUTF(DataInputStream.java:589)
at java.io.DataInputStream.readUTF(DataInputStream.java:564)
at org.apache.flink.runtime.checkpoint.metadata.MetadataV2V3SerializerBase.deserializeStreamStateHandleMap(MetadataV2V3SerializerBase.java:730)
at org.apache.flink.runtime.checkpoint.metadata.MetadataV2V3SerializerBase.deserializeKeyedStateHandle(MetadataV2V3SerializerBase.java:408)
at org.apache.flink.runtime.checkpoint.metadata.MetadataV2V3SerializerBase.deserializeSubtaskState(MetadataV2V3SerializerBase.java:269)
at org.apache.flink.runtime.checkpoint.metadata.MetadataV3Serializer.deserializeOperatorState(MetadataV3Serializer.java:183)
at org.apache.flink.runtime.checkpoint.metadata.MetadataV2V3SerializerBase.deserializeMetadata(MetadataV2V3SerializerBase.java:164)
at org.apache.flink.runtime.checkpoint.metadata.MetadataV3Serializer.deserialize(MetadataV3Serializer.java:89)
at org.apache.flink.runtime.checkpoint.Checkpoints.loadCheckpointMetadata(Checkpoints.java:110)
at org.apache.flink.runtime.checkpoint.Checkpoints.loadAndValidateCheckpoint(Checkpoints.java:140)
at org.apache.flink.runtime.checkpoint.CheckpointCoordinator.restoreSavepoint(CheckpointCoordinator.java:1648)
at org.apache.flink.runtime.scheduler.DefaultExecutionGraphFactory.tryRestoreExecutionGraphFromSavepoint(DefaultExecutionGraphFactory.java:163)
at org.apache.flink.runtime.scheduler.DefaultExecutionGraphFactory.createAndRestoreExecutionGraph(DefaultExecutionGraphFactory.java:138)
at org.apache.flink.runtime.scheduler.SchedulerBase.createAndRestoreExecutionGraph(SchedulerBase.java:335)
at org.apache.flink.runtime.scheduler.SchedulerBase.<init>(SchedulerBase.java:191)
at org.apache.flink.runtime.scheduler.DefaultScheduler.<init>(DefaultScheduler.java:140)
at org.apache.flink.runtime.scheduler.DefaultSchedulerFactory.createInstance(DefaultSchedulerFactory.java:134)
at org.apache.flink.runtime.jobmaster.DefaultSlotPoolServiceSchedulerFactory.createScheduler(DefaultSlotPoolServiceSchedulerFactory.java:110)
at org.apache.flink.runtime.jobmaster.JobMaster.createScheduler(JobMaster.java:346)
at org.apache.flink.runtime.jobmaster.JobMaster.<init>(JobMaster.java:323)
at org.apache.flink.runtime.jobmaster.factories.DefaultJobMasterServiceFactory.internalCreateJobMasterService(DefaultJobMasterServiceFactory.java:106)
at org.apache.flink.runtime.jobmaster.factories.DefaultJobMasterServiceFactory.lambda$createJobMasterService$0(DefaultJobMasterServiceFactory.java:94)
at org.apache.flink.util.function.FunctionUtils.lambda$uncheckedSupplier$4(FunctionUtils.java:112)
... 4 common frames omitted
Is the checkpoint corrupt?
When I debug, it looks like flink expects a certain number of StreamStateHandle to be found, but the stream ends before reading them all.
Does flink has some mechanism to ignore the checkpoint when it failed to recover from it? The process fails because of the checkpoint. Is there a way to recover from this failure?
Thanks

Related

Flink Task Manager Suddenly Crashed

Flink TM suddenly got crashed after 3 months of running with the below error stack trace.
2021-12-05 07:22:05,369 WARN org.apache.flink.runtime.taskmanager.Task [] - Task 'GlobalWindowAggregate(groupBy=[org, $f4], window=[HOP(slice_end=[$slice_end], size=[15 min], slide=[1 min])], select=[org, $f4, COUNT(distinct$0 count$0) AS $f2, COUNT(count1$1) AS window_start, start('w$) AS window_end]) -> Calc(select=[window_start, window_end, org, $f4, $f2 AS $f4_0]) (1/24)#6' did not react to cancelling signal for 30 seconds, but is stuck in method:
org.apache.flink.runtime.io.network.partition.consumer.BufferManager.notifyBufferAvailable(BufferManager.java:296)
org.apache.flink.runtime.io.network.buffer.LocalBufferPool.fireBufferAvailableNotification(LocalBufferPool.java:507)
org.apache.flink.runtime.io.network.buffer.LocalBufferPool.recycle(LocalBufferPool.java:494)
org.apache.flink.runtime.io.network.buffer.LocalBufferPool.recycle(LocalBufferPool.java:460)
org.apache.flink.runtime.io.network.buffer.NetworkBuffer.deallocate(NetworkBuffer.java:182)
org.apache.flink.shaded.netty4.io.netty.buffer.AbstractReferenceCountedByteBuf.handleRelease(AbstractReferenceCountedByteBuf.java:110)
org.apache.flink.shaded.netty4.io.netty.buffer.AbstractReferenceCountedByteBuf.release(AbstractReferenceCountedByteBuf.java:100)
org.apache.flink.runtime.io.network.buffer.NetworkBuffer.recycleBuffer(NetworkBuffer.java:156)
org.apache.flink.runtime.io.network.partition.consumer.BufferManager$AvailableBufferQueue.addExclusiveBuffer(BufferManager.java:399)
org.apache.flink.runtime.io.network.partition.consumer.BufferManager.recycle(BufferManager.java:200)
org.apache.flink.runtime.io.network.buffer.NetworkBuffer.deallocate(NetworkBuffer.java:182)
org.apache.flink.shaded.netty4.io.netty.buffer.AbstractReferenceCountedByteBuf.handleRelease(AbstractReferenceCountedByteBuf.java:110)
org.apache.flink.shaded.netty4.io.netty.buffer.AbstractReferenceCountedByteBuf.release(AbstractReferenceCountedByteBuf.java:100)
org.apache.flink.runtime.io.network.buffer.NetworkBuffer.recycleBuffer(NetworkBuffer.java:156)
org.apache.flink.runtime.io.network.api.serialization.SpillingAdaptiveSpanningRecordDeserializer.getNextRecord(SpillingAdaptiveSpanningRecordDeserializer.java:95)
org.apache.flink.streaming.runtime.io.AbstractStreamTaskNetworkInput.emitNext(AbstractStreamTaskNetworkInput.java:95)
org.apache.flink.streaming.runtime.io.StreamOneInputProcessor.processInput(StreamOneInputProcessor.java:66)
org.apache.flink.streaming.runtime.tasks.StreamTask.processInput(StreamTask.java:423)
org.apache.flink.streaming.runtime.tasks.StreamTask$$Lambda$615/1465249724.runDefaultAction(Unknown Source)
org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMailboxLoop(MailboxProcessor.java:204)
org.apache.flink.streaming.runtime.tasks.StreamTask.runMailboxLoop(StreamTask.java:681)
org.apache.flink.streaming.runtime.tasks.StreamTask.executeInvoke(StreamTask.java:636)
org.apache.flink.streaming.runtime.tasks.StreamTask$$Lambda$1480/994476387.run(Unknown Source)
org.apache.flink.streaming.runtime.tasks.StreamTask.runWithCleanUpOnFail(StreamTask.java:647)
org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:620)
org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:779)
org.apache.flink.runtime.taskmanager.Task.run(Task.java:566)
java.lang.Thread.run(Thread.java:748)
2021-12-05 07:22:05,370 INFO org.apache.flink.runtime.taskexecutor.slot.TaskSlotTableImpl [] - Free slot TaskSlot(index:7, state:ALLOCATED, resource profile: ResourceProfile{cpuCores=2.0000000000000000, taskHeapMemory=2.656gb (2852126690 bytes), taskOffHeapMemory=0 bytes, managedMemory=1.875gb (2013265950 bytes), networkMemory=128.000mb (134217728 bytes)}, allocationId: 2b2d5beb481130d88a1eaaa0d3be2f7d, jobId: a5ed6a11efac85d315195eb9e7534316).
2021-12-05 07:22:05,370 INFO org.apache.flink.runtime.taskmanager.Task [] - Attempting to fail task externally GlobalWindowAggregate(groupBy=[org, $f4], window=[HOP(slice_end=[$slice_end], size=[15 min], slide=[1 min])], select=[org, $f4, COUNT(distinct$0 count$0) AS $f2, COUNT(count1$1) AS window_start, start('w$) AS window_end]) -> Calc(select=[window_start, window_end, org, $f4, $f2 AS $f4_0]) (1/24)#6 (5e34a8de7bcff882f37c073f250c2594).
2021-12-05 07:22:05,370 INFO org.apache.flink.runtime.taskmanager.Task [] - Task GlobalWindowAggregate(groupBy=[org, $f4], window=[HOP(slice_end=[$slice_end], size=[15 min], slide=[1 min])], select=[org, $f4, COUNT(distinct$0 count$0) AS $f2, COUNT(count1$1) AS window_start, start('w$) AS window_end]) -> Calc(select=[window_start, window_end, org, $f4, $f2 AS $f4_0]) (1/24)#6 is already in state CANCELING
2021-12-05 07:22:05,372 INFO org.apache.flink.runtime.taskexecutor.slot.TaskSlotTableImpl [] - Free slot TaskSlot(index:7, state:RELEASING, resource profile: ResourceProfile{cpuCores=2.0000000000000000, taskHeapMemory=2.656gb (2852126690 bytes), taskOffHeapMemory=0 bytes, managedMemory=1.875gb (2013265950 bytes), networkMemory=128.000mb (134217728 bytes)}, allocationId: 2b2d5beb481130d88a1eaaa0d3be2f7d, jobId: a5ed6a11efac85d315195eb9e7534316).
2021-12-05 07:22:15,362 ERROR org.apache.flink.runtime.taskexecutor.TaskManagerRunner [] - Terminating TaskManagerRunner with exit code 1.
org.apache.flink.util.FlinkException: Unexpected failure during runtime of TaskManagerRunner.
at org.apache.flink.runtime.taskexecutor.TaskManagerRunner.runTaskManager(TaskManagerRunner.java:382) ~[flink-dist_2.12-1.13.1.jar:1.13.1]
at org.apache.flink.runtime.taskexecutor.TaskManagerRunner.lambda$runTaskManagerProcessSecurely$3(TaskManagerRunner.java:413) ~[flink-dist_2.12-1.13.1.jar:1.13.1]
at org.apache.flink.runtime.security.contexts.NoOpSecurityContext.runSecured(NoOpSecurityContext.java:28) ~[flink-dist_2.12-1.13.1.jar:1.13.1]
at org.apache.flink.runtime.taskexecutor.TaskManagerRunner.runTaskManagerProcessSecurely(TaskManagerRunner.java:413) [flink-dist_2.12-1.13.1.jar:1.13.1]
at org.apache.flink.runtime.taskexecutor.TaskManagerRunner.runTaskManagerProcessSecurely(TaskManagerRunner.java:396) [flink-dist_2.12-1.13.1.jar:1.13.1]
at org.apache.flink.runtime.taskexecutor.TaskManagerRunner.main(TaskManagerRunner.java:354) [flink-dist_2.12-1.13.1.jar:1.13.1]
Caused by: java.util.concurrent.TimeoutException
at org.apache.flink.runtime.concurrent.FutureUtils$Timeout.run(FutureUtils.java:1255) ~[flink-dist_2.12-1.13.1.jar:1.13.1]
at org.apache.flink.runtime.concurrent.DirectExecutorService.execute(DirectExecutorService.java:217) ~[flink-dist_2.12-1.13.1.jar:1.13.1]
at org.apache.flink.runtime.concurrent.FutureUtils.lambda$orTimeout$15(FutureUtils.java:582) ~[flink-dist_2.12-1.13.1.jar:1.13.1]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) ~[?:1.8.0_232]
at java.util.concurrent.FutureTask.run(FutureTask.java:266) ~[?:1.8.0_232]
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) ~[?:1.8.0_232]
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) ~[?:1.8.0_232]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[?:1.8.0_232]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[?:1.8.0_232]
at java.lang.Thread.run(Thread.java:748) ~[?:1.8.0_232]
2021-12-05 07:22:15,365 INFO org.apache.flink.runtime.blob.TransientBlobCache [] - Shutting down BLOB cache
2021-12-05 07:22:15,365 INFO org.apache.flink.runtime.blob.PermanentBlobCache [] - Shutting down BLOB cache
2021-12-05 07:22:15,365 INFO org.apache.flink.runtime.state.TaskExecutorLocalStateStoresManager [] - Shutting down TaskExecutorLocalStateStoresManager.
2021-12-05 07:22:15,365 INFO org.apache.flink.runtime.filecache.FileCache [] - removed file cache directory /tmp/flink-dist-cache-9fad861a-b657-4625-a184-db126c423c2f
While debugging, I found Input and output buffer usage reached 100% usage on datadog dashboard.
Also found out that last 2 checkpoints got failed with message Checkpoint expired before completing. Checkpoint timeout is 2 mins.
How can I fix this issue.
Checkpoint timeouts are generally caused by either
backpressure causing the checkpoint barriers to progress too slowly across the execution graph, or
some sort of bottleneck preventing Flink from writing fast enough to the checkpoint storage (e.g., network starvation, insufficient iops quota)
It looks like you are using unaligned checkpointing. This should help with point number 1 above, but could be causing point number 2 to be a problem, since unaligned checkpoints increase the amount of data being checkpointed (by up to about a 1GB in your case, it looks like).
You might just want to increase the checkpoint timeout. Having checkpoints timeout is almost never helpful.
But it also appears that you have significant backpressure. Figuring out what's causing that and doing something about it should help. (If you can upgrade to Flink 1.13 (or later) the improved backpressure monitoring will make this easier.) Perhaps you have data skew, or perhaps you need to scale up the cluster.

Problem during state restore; when Flink job is submitted

We are getting the exception, copied at the end of this post. The exception is thrown when a new flink job is submitted; when Flink tries to restore the previous state.
Environment:
Flink version: 1.10.1
State persistence: Hadoop 3.3
Zookeeper 3.5.8
Parallelism: 4
The code implements DataStream Transformation functions: ProcessFunction -> KeySelector -> ProcessFunction. Inbound messages are partitioned by key "sourceId" which is a part of the exception stack trace. SourceId is String type and is unique.
Caused by: com.esotericsoftware.kryo.KryoException: java.lang.IndexOutOfBoundsException: Index: 109, Size: 10
Serialization trace:
sourceId (com.contineo.ext.flink.core.ThingState)
We have overridden "org.apache.flink.streaming.api.functions.ProcessFunction.open()" method
Any help is appreciated
Exception stack trace:
2021-01-19 19:59:56,934 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Checkpoint triggering task Source: Custom Source -> Process -> Process (3/4) of job c957f40043721b5cab3161991999a7ed is not in state RUNNING but DEPLOYING instead. Aborting checkpoint.
2021-01-19 19:59:57,358 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Process -> Sink: Unnamed (4/4) (b2605627c2fffc83dd412b3e7565244d) switched from RUNNING to FAILED.
java.lang.Exception: Exception while creating StreamOperatorStateContext.
at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.streamOperatorStateContext(StreamTaskStateInitializerImpl.java:191)
at org.apache.flink.streaming.api.operators.AbstractStreamOperator.initializeState(AbstractStreamOperator.java:255)
at org.apache.flink.streaming.runtime.tasks.StreamTask.initializeStateAndOpen(StreamTask.java:989)
at org.apache.flink.streaming.runtime.tasks.StreamTask.lambda$beforeInvoke$0(StreamTask.java:453)
at org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$SynchronizedStreamTaskActionExecutor.runThrowing(StreamTaskActionExecutor.java:94)
at org.apache.flink.streaming.runtime.tasks.StreamTask.beforeInvoke(StreamTask.java:448)
at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:460)
at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:708)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:533)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.flink.util.FlinkException: Could not restore keyed state backend for LegacyKeyedProcessOperator_c27dcf7b54ef6bfd6cff02ca8870b681_(4/4) from any of the 1 provided restore options.
at org.apache.flink.streaming.api.operators.BackendRestorerProcedure.createAndRestore(BackendRestorerProcedure.java:135)
at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.keyedStatedBackend(StreamTaskStateInitializerImpl.java:304)
at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.streamOperatorStateContext(StreamTaskStateInitializerImpl.java:131)
... 9 more
Caused by: org.apache.flink.runtime.state.BackendBuildingException: Failed when trying to restore heap backend
at org.apache.flink.runtime.state.heap.HeapKeyedStateBackendBuilder.build(HeapKeyedStateBackendBuilder.java:116)
at org.apache.flink.runtime.state.filesystem.FsStateBackend.createKeyedStateBackend(FsStateBackend.java:529)
at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.lambda$keyedStatedBackend$1(StreamTaskStateInitializerImpl.java:288)
at org.apache.flink.streaming.api.operators.BackendRestorerProcedure.attemptCreateAndRestore(BackendRestorerProcedure.java:142)
at org.apache.flink.streaming.api.operators.BackendRestorerProcedure.createAndRestore(BackendRestorerProcedure.java:121)
... 11 more
Caused by: com.esotericsoftware.kryo.KryoException: java.lang.IndexOutOfBoundsException: Index: 109, Size: 10
Serialization trace:
sourceId (com.contineo.ext.flink.core.ThingState)
at com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:125)
at com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:528)
at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:761)
at org.apache.flink.api.java.typeutils.runtime.kryo.KryoSerializer.deserialize(KryoSerializer.java:346)
at org.apache.flink.runtime.state.heap.StateTableByKeyGroupReaders.lambda$createV2PlusReader$0(StateTableByKeyGroupReaders.java:77)
at org.apache.flink.runtime.state.KeyGroupPartitioner$PartitioningResultKeyGroupReader.readMappingsInKeyGroup(KeyGroupPartitioner.java:297)
at org.apache.flink.runtime.state.heap.HeapRestoreOperation.readKeyGroupStateData(HeapRestoreOperation.java:293)
at org.apache.flink.runtime.state.heap.HeapRestoreOperation.readStateHandleStateData(HeapRestoreOperation.java:254)
at org.apache.flink.runtime.state.heap.HeapRestoreOperation.restore(HeapRestoreOperation.java:154)
at org.apache.flink.runtime.state.heap.HeapKeyedStateBackendBuilder.build(HeapKeyedStateBackendBuilder.java:114)
... 15 more
Caused by: java.lang.IndexOutOfBoundsException: Index: 109, Size: 10
at java.util.ArrayList.rangeCheck(ArrayList.java:659)
at java.util.ArrayList.get(ArrayList.java:435)
at com.esotericsoftware.kryo.util.MapReferenceResolver.getReadObject(MapReferenceResolver.java:42)
at com.esotericsoftware.kryo.Kryo.readReferenceOrNull(Kryo.java:805)
at com.esotericsoftware.kryo.Kryo.readObjectOrNull(Kryo.java:728)
at com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:113)
... 24 more

is there any way to fix solr index

I am running a program that crawls the web and saves data into a solr index. for mysterious reasons, the solr server crashed. And now I end up with a corrupted index that has no segment files and hence risking losing all my data collected for 5 days....
The error message reads as below when you try to search on this index. the index folder definitely has data, as it has 182 files and 2GB in size.
I have tried to use CheckIndex but get the same error about no segment files...
java.util.concurrent.ExecutionException: org.apache.solr.common.SolrException: Unable to create core [chase]
at java.util.concurrent.FutureTask.report(FutureTask.java:122)
at java.util.concurrent.FutureTask.get(FutureTask.java:192)
at org.apache.solr.core.CoreContainer.lambda$load$6(CoreContainer.java:586)
at com.codahale.metrics.InstrumentedExecutorService$InstrumentedRunnable.run(InstrumentedExecutorService.java:176)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:229)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.solr.common.SolrException: Unable to create core [chase]
at org.apache.solr.core.CoreContainer.create(CoreContainer.java:935)
at org.apache.solr.core.CoreContainer.lambda$load$5(CoreContainer.java:558)
at com.codahale.metrics.InstrumentedExecutorService$InstrumentedCallable.call(InstrumentedExecutorService.java:197)
... 5 more
Caused by: org.apache.solr.common.SolrException: Error opening new searcher
at org.apache.solr.core.SolrCore.<init>(SolrCore.java:977)
at org.apache.solr.core.SolrCore.<init>(SolrCore.java:830)
at org.apache.solr.core.CoreContainer.create(CoreContainer.java:920)
... 7 more
Caused by: org.apache.solr.common.SolrException: Error opening new searcher
at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:2069)
at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:2189)
at org.apache.solr.core.SolrCore.initSearcher(SolrCore.java:1071)
at org.apache.solr.core.SolrCore.<init>(SolrCore.java:949)
... 9 more
Caused by: org.apache.lucene.index.IndexNotFoundException: no segments* file found in LockValidatingDirectoryWrapper(NRTCachingDirectory(MMapDirectory#/home/zqz/Work/chase/aws/data/solr/chase/data/index lockFactory=org.apache.lucene.store.NativeFSLockFactory#51b2fc7e; maxCacheMB=48.0 maxMergeSizeMB=4.0)): files: [_fh2.fdt, _fh2.fdx, _fh2.fnm, _fh2.nvd, _fh2.nvm, _fh2.si, _fh2_Lucene50_0.doc, _fh2_Lucene50_0.pos, _fh2_Lucene50_0.tim, _fh2_Lucene50_0.tip, write.lock]
at org.apache.lucene.index.IndexWriter.<init>(IndexWriter.java:925)
at org.apache.solr.update.SolrIndexWriter.<init>(SolrIndexWriter.java:118)
at org.apache.solr.update.SolrIndexWriter.create(SolrIndexWriter.java:93)
at org.apache.solr.update.DefaultSolrCoreState.createMainIndexWriter(DefaultSolrCoreState.java:248)
at org.apache.solr.update.DefaultSolrCoreState.getIndexWriter(DefaultSolrCoreState.java:122)
at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:2030)
... 12 more
2017-06-20 14:38:52.428 INFO (qtp475266352-16) [ ] o.a.s.c.TransientSolrCoreCacheDefault Allocating transient cache for 2147483647 transient cores
2017-06-20 14:38:52.894 INFO (qtp475266352-13) [ ] o.a.s.s.HttpSolrCall [admin] webapp=null path=/admin/cores params={indexInfo=false&wt=json&_=1497969532681} status=0 QTime=11
2017-06-20 14:38:52.962 INFO (qtp475266352-20) [ ] o.a.s.s.HttpSolrCall [admin] webapp=null path=/admin/info/system params={wt=json&_=1497969532684} status=0 QTime=76
The error you mentioned is caused by the missing file :
segments* e.g. segments_3 ...
in the index files :
files: [_fh2.fdt, _fh2.fdx, _fh2.fnm, _fh2.nvd, _fh2.nvm, _fh2.si, _fh2_Lucene50_0.doc, _fh2_Lucene50_0.pos, _fh2_Lucene50_0.tim, _fh2_Lucene50_0.tip, write.lock]
That file specifies the last commit point and the last generation of segments to take into account and apparently it is missing.
Check if that file is there and is readable.
If it is not ( because for example the index writer was not closed properly due to the mulfuction, do not despair.
Chances are there that the transaction log contains still the documents you indexed, so you could just replay it and get the documents back ( clean the index dir, make solr starting and it should take care).
Solr allows also a backup functionality, so for the future you may want to configure it.

Error during node startup: Unable to start DSE server / Plugin activation failed / Cannot find core

I've been having these issues for quite a while already but I ignored them initially because I can still start my nodes. However, one of these issues became more serious recently that it now takes me a lot of tries in order to successfully start a node.
Issue #1: Unable to start DSE server / Plugin activation failed / Cannot find core
ERROR [main] 2015-01-28 03:30:40,058 DseDaemon.java (line 492) Unable to start DSE server.
java.lang.RuntimeException: com.datastax.bdp.plugin.PluginManager$PluginActivationException: Plugin activation failed
at com.datastax.bdp.plugin.PluginManager.activate(PluginManager.java:135)
at com.datastax.bdp.server.DseDaemon.start(DseDaemon.java:480)
at org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:509)
at com.datastax.bdp.server.DseDaemon.main(DseDaemon.java:659)
Caused by: com.datastax.bdp.plugin.PluginManager$PluginActivationException: Plugin activation failed
at com.datastax.bdp.plugin.PluginManager.activate(PluginManager.java:284)
at com.datastax.bdp.plugin.PluginManager.activate(PluginManager.java:128)
... 3 more
Caused by: java.lang.IllegalStateException: Cannot find core: myks.mycf
at com.datastax.bdp.search.solr.core.SolrCoreResourceManager.doWaitForCore(SolrCoreResourceManager.java:742)
at com.datastax.bdp.search.solr.core.SolrCoreResourceManager.waitForCore(SolrCoreResourceManager.java:478)
at com.datastax.bdp.plugin.SolrContainerPlugin.waitForSecondaryIndexesLoading(SolrContainerPlugin.java:237)
at com.datastax.bdp.plugin.SolrContainerPlugin.onActivate(SolrContainerPlugin.java:98)
at com.datastax.bdp.plugin.PluginManager.initialize(PluginManager.java:334)
at com.datastax.bdp.plugin.PluginManager.activate(PluginManager.java:263)
... 4 more
INFO [Thread-3] 2015-01-28 03:30:40,059 DseDaemon.java (line 505) DSE shutting down...
INFO [StorageServiceShutdownHook] 2015-01-28 03:30:40,164 Gossiper.java (line 1307) Announcing shutdown
INFO [Thread-3] 2015-01-28 03:30:40,620 PluginManager.java (line 356) All plugins are stopped.
INFO [Thread-3] 2015-01-28 03:30:40,620 CassandraDaemon.java (line 463) Cassandra shutting down...
INFO [StorageServiceShutdownHook] 2015-01-28 03:30:42,165 MessagingService.java (line 701) Waiting for messaging service to quiesce
INFO [ACCEPT-/144.76.201.233] 2015-01-28 03:30:42,814 MessagingService.java (line 941) MessagingService has terminated the accept() thread
This exception started as a "mild" issue - mild because although it prevents a node from starting up when it happens, it usually takes me 1 more try to successfully start the affected node. However, about two weeks ago, after having not restarted any of my nodes for quite a while, I discovered that I now need a lot more attempts (20+) in order to start a node.
From the stack trace, it looks like a timeout issue (in doWaitForCore()); but I cannot find a setting to increase the amount of time that DSE would wait for a core to load during startup before giving up. The core that is mentioned in the stack trace is always the same, and I assume that this is because it is my biggest core (~1.4 billions records) and it takes the longest time to load. But when I manage to start the node successfully, there are no signs of errors - I can query the core like any other core.
--
There are two other issues that may or may not be related to the one above. Both of them always appear during startup; and unlike the first one, they do not cause a startup failure (i.e. they also appear when a node starts successfully)
Issue #2: Invalid Number: static
ERROR [searcherExecutor-67-thread-1] 2015-01-28 04:26:49,691 SolrException.java (line 124) org.apache.solr.common.SolrException: Invalid Number: static
at org.apache.solr.schema.TrieField.readableToIndexed(TrieField.java:396)
at org.apache.solr.schema.FieldType.getFieldQuery(FieldType.java:697)
at org.apache.solr.schema.TrieField.getFieldQuery(TrieField.java:343)
at org.apache.solr.parser.SolrQueryParserBase.getFieldQuery(SolrQueryParserBase.java:741)
at org.apache.solr.parser.SolrQueryParserBase.handleBareTokenQuery(SolrQueryParserBase.java:545)
at org.apache.solr.parser.QueryParser.Term(QueryParser.java:300)
at org.apache.solr.parser.QueryParser.Clause(QueryParser.java:186)
at org.apache.solr.parser.QueryParser.Query(QueryParser.java:108)
at org.apache.solr.parser.QueryParser.TopLevelQuery(QueryParser.java:97)
at org.apache.solr.parser.SolrQueryParserBase.parse(SolrQueryParserBase.java:153)
at org.apache.solr.search.LuceneQParser.parse(LuceneQParser.java:50)
at org.apache.solr.search.QParser.getQuery(QParser.java:143)
at org.apache.solr.handler.component.QueryComponent.prepare(QueryComponent.java:135)
at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:183)
I looked at the data that I imported and I couldn't find a supposedly-numeric value that was incorrectly supplied as "static". In the java application that I wrote to convert CSVs to SSTables, I cast all numeric values to int/long/double depending on the field type so I honestly don't think that it has something to do with my data.
Issue #3: Could not getStatistics on info bean com.datastax.bdp.search.solr.FilterCacheMBean
WARN [SolrSecondaryIndex myks.mycf2 index initializer.] 2015-01-28 04:26:51,770 JmxMonitoredMap.java (line 256) Could not getStatistics on info bean com.datastax.bdp.search.solr.FilterCacheMBean
java.lang.RuntimeException: java.lang.ClassCastException: org.apache.lucene.search.FieldCache$CreationPlaceholder cannot be cast to org.apache.solr.search.SolrCache
at com.datastax.bdp.search.solr.FilterCacheMBean.getStatistics(FilterCacheMBean.java:185)
at org.apache.solr.core.JmxMonitoredMap$SolrDynamicMBean.getMBeanInfo(JmxMonitoredMap.java:236)
at com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.getNewMBeanClassName(DefaultMBeanServerInterceptor.java:333)
at com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.registerMBean(DefaultMBeanServerInterceptor.java:319)
at com.sun.jmx.mbeanserver.JmxMBeanServer.registerMBean(JmxMBeanServer.java:522)
at org.apache.solr.core.JmxMonitoredMap.put(JmxMonitoredMap.java:140)
at org.apache.solr.core.JmxMonitoredMap.put(JmxMonitoredMap.java:51)
at com.datastax.bdp.search.solr.core.CassandraCoreContainer.registerExtraMBeans(CassandraCoreContainer.java:679)
at com.datastax.bdp.search.solr.core.CassandraCoreContainer.register(CassandraCoreContainer.java:427)
at com.datastax.bdp.search.solr.core.CassandraCoreContainer.doLoad(CassandraCoreContainer.java:757)
at com.datastax.bdp.search.solr.core.CassandraCoreContainer.load(CassandraCoreContainer.java:162)
at com.datastax.bdp.search.solr.AbstractSolrSecondaryIndex$2.run(AbstractSolrSecondaryIndex.java:882)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.ClassCastException: org.apache.lucene.search.FieldCache$CreationPlaceholder cannot be cast to org.apache.solr.search.SolrCache
at com.datastax.bdp.search.solr.FilterCacheMBean.getStatistics(FilterCacheMBean.java:174)
... 16 more
I have absolutely no idea what this is.
--
Has anyone encountered these errors/exceptions/warnings before? What did you do?
Issue #1: The max waiting time to load a core was hard-coded at 1 min. So, your assumption is right: a very large core or hundreds of cores could prevent the node starting due to the excessive time to load this particular core. In the next patch release (4.5.6, 4.6.1) we address this issue by creating a new option load_max_time_per_core in dse.yaml. This option allows you to increase the max waiting time for core loading, starting at 1 min. For 500 cores you would need to increase load_max_time_per_core to about 3 minutes, for example.
Issue #2: Unfortunately, I don't know what could be causing this. We would need further info about this to see why it's happening.
Issue #3: We have currently investigating what this can be.
Regarding issue #2, are you sure you don't have a QuerySenderListener with a wrong warmup query in your solrconfig?

Solr error when doing full-import 250000 rows org.apache.solr.common.SolrException;null:org.eclipse.jetty.io.EofException

I am using solr 4.6.0 with jetty on windows 7 enterrpise with max heap of 2G.I can do a full-import for 200,000 records properly from the Solr Admin UI but as soon as I increase to 250,000 records, it starts giving me this error below:
webapp=/solr path=/dataimport params={optimize=false&clean=false&indent=true&commit=true&verbose=true&entity=files&command=full-import&debug=true&wt=json&rows=250000} {add=[8065121, 8065126, 8065128, 8065146, 8065963, 7838189, 7838186, 8065155, 8065174, 8065179, ... (250001 adds)],commit=} 0 2693420
org.apache.solr.common.SolrException; null:org.eclipse.jetty.io.EofException
at org.eclipse.jetty.http.HttpGenerator.flushBuffer(HttpGenerator.java:914)
at org.eclipse.jetty.http.AbstractGenerator.blockForOutput(AbstractGenerator.java:507)
at org.eclipse.jetty.server.HttpOutput.write(HttpOutput.java:170)
at org.eclipse.jetty.server.HttpOutput.write(HttpOutput.java:107)
at sun.nio.cs.StreamEncoder.writeBytes(Unknown Source)
at su
Caused by: java.net.SocketException: Software caused connection abort: socket write error at java.net.SocketOutputStream.socketWrite0(Native Method)
at j......
org.apache.solr.common.SolrException;null:org.eclipse.jetty.io.EofException at org.eclipse.jetty.http.HttpGenerator.flushBuffer(HttpGenerator.java:914)
org.eclipse.jetty.servlet.ServletHandler; /solr/dihdb/dataimport
java.lang.IllegalStateException: Committed
at org.eclipse.jetty.server.Response.resetBuffer(Response.java:1144)
I have changed example/etc/jetty.xml as follows for maxIdleTime=3500000.
I changed example/etc/webdefault.xml for session-timeout=720.
I still keep getting the error above.
TIA,
Vijay
I changed -Xmx5120M and that seems to have fixed the issue with 500K and 1 million records.Lack of memory in essence was the issue for this misleading error showing up.
Also tried 100000 1800000 for DataImportHandler.

Resources